🤖 AI Summary
Zero-shot compositional image retrieval (ZS-CIR) faces three key challenges: ambiguous user intent, entanglement of positive and negative semantics, and the mismatch between the single-target assumption and real-world query ambiguity. To address these, we propose SoFT—a training-free soft filtering module that introduces dual-track text constraint modeling for the first time, explicitly distinguishing prescriptive (“must include”) from proscriptive (“must avoid”) semantics. Leveraging multimodal large language models (MLLMs), SoFT dynamically parses both constraint types from reference images and modification texts to enable zero-shot re-ranking of retrieval results. Additionally, we design a pipeline for generating multi-objective CIR benchmarks supporting fine-grained, ambiguity-robust evaluation. On CIRR, CIRCO, and FashionIQ, SoFT achieves improvements of +12.94 in R@5, +6.13 in mAP@50, and +4.59 in R@50, significantly enhancing both robustness and accuracy of ZS-CIR.
📝 Abstract
Composed Image Retrieval (CIR) aims to find a target image that aligns with user intent, expressed through a reference image and a modification text. While Zero-shot CIR (ZS-CIR) methods sidestep the need for labeled training data by leveraging pretrained vision-language models, they often rely on a single fused query that merges all descriptive cues of what the user wants, tending to dilute key information and failing to account for what they wish to avoid. Moreover, current CIR benchmarks assume a single correct target per query, overlooking the ambiguity in modification texts. To address these challenges, we propose Soft Filtering with Textual constraints (SoFT), a training-free, plug-and-play filtering module for ZS-CIR. SoFT leverages multimodal large language models (LLMs) to extract two complementary constraints from the reference-modification pair: prescriptive (must-have) and proscriptive (must-avoid) constraints. These serve as semantic filters that reward or penalize candidate images to re-rank results, without modifying the base retrieval model or adding supervision. In addition, we construct a two-stage dataset pipeline that refines CIR benchmarks. We first identify multiple plausible targets per query to construct multi-target triplets, capturing the open-ended nature of user intent. Then guide multimodal LLMs to rewrite the modification text to focus on one target, while referencing contrastive distractors to ensure precision. This enables more comprehensive and reliable evaluation under varying ambiguity levels. Applied on top of CIReVL, a ZS-CIR retriever, SoFT raises R@5 to 65.25 on CIRR (+12.94), mAP@50 to 27.93 on CIRCO (+6.13), and R@50 to 58.44 on FashionIQ (+4.59), demonstrating broad effectiveness.