STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenges in zero-shot compositional image retrieval, where semantic gaps between images and text lead large language models to generate descriptions incorporating irrelevant features, and point-to-point alignment struggles to capture diverse compositions. To overcome these issues, the authors propose a training-free framework that synergizes semantic transfer and optimal transport. Specifically, the method refines language-model-generated compositional descriptions in the embedding space using transfer vectors to emphasize core modification intent. It further introduces a novel set-to-set cross-modal alignment mechanism based on bidirectional optimal transport distance, formulating retrieval as a set-matching problem. Experiments demonstrate that the proposed approach significantly outperforms existing methods across multiple compositional image retrieval benchmarks, confirming its generality, effectiveness, and robustness.

📝 Abstract

Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, we introduce a novel Semantic Transition and Transportation in collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score. Extensive experiments demonstrate that our method can be general, effective, and beneficial for many CIR tasks.

Problem

Research questions and friction points this paper is trying to address.

zero-shot composed image retrieval

semantic gap

point-to-point alignment

multimodal retrieval

compositional understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

zero-shot composed image retrieval

semantic transition