Soft Filtering: Guiding Zero-shot Composed Image Retrieval with Prescriptive and Proscriptive Constraints

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Zero-shot compositional image retrieval (ZS-CIR) faces three key challenges: ambiguous user intent, entanglement of positive and negative semantics, and the mismatch between the single-target assumption and real-world query ambiguity. To address these, we propose SoFT—a training-free soft filtering module that introduces dual-track text constraint modeling for the first time, explicitly distinguishing prescriptive (“must include”) from proscriptive (“must avoid”) semantics. Leveraging multimodal large language models (MLLMs), SoFT dynamically parses both constraint types from reference images and modification texts to enable zero-shot re-ranking of retrieval results. Additionally, we design a pipeline for generating multi-objective CIR benchmarks supporting fine-grained, ambiguity-robust evaluation. On CIRR, CIRCO, and FashionIQ, SoFT achieves improvements of +12.94 in R@5, +6.13 in mAP@50, and +4.59 in R@50, significantly enhancing both robustness and accuracy of ZS-CIR.

Technology Category

Application Category

📝 Abstract

Composed Image Retrieval (CIR) aims to find a target image that aligns with user intent, expressed through a reference image and a modification text. While Zero-shot CIR (ZS-CIR) methods sidestep the need for labeled training data by leveraging pretrained vision-language models, they often rely on a single fused query that merges all descriptive cues of what the user wants, tending to dilute key information and failing to account for what they wish to avoid. Moreover, current CIR benchmarks assume a single correct target per query, overlooking the ambiguity in modification texts. To address these challenges, we propose Soft Filtering with Textual constraints (SoFT), a training-free, plug-and-play filtering module for ZS-CIR. SoFT leverages multimodal large language models (LLMs) to extract two complementary constraints from the reference-modification pair: prescriptive (must-have) and proscriptive (must-avoid) constraints. These serve as semantic filters that reward or penalize candidate images to re-rank results, without modifying the base retrieval model or adding supervision. In addition, we construct a two-stage dataset pipeline that refines CIR benchmarks. We first identify multiple plausible targets per query to construct multi-target triplets, capturing the open-ended nature of user intent. Then guide multimodal LLMs to rewrite the modification text to focus on one target, while referencing contrastive distractors to ensure precision. This enables more comprehensive and reliable evaluation under varying ambiguity levels. Applied on top of CIReVL, a ZS-CIR retriever, SoFT raises R@5 to 65.25 on CIRR (+12.94), mAP@50 to 27.93 on CIRCO (+6.13), and R@50 to 58.44 on FashionIQ (+4.59), demonstrating broad effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Addresses information dilution in zero-shot composed image retrieval.

Handles ambiguity in modification texts lacking multiple plausible targets.

Extracts complementary constraints without modifying base retrieval models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft Filtering uses multimodal LLMs to extract prescriptive and proscriptive constraints

SoFT module re-ranks results by rewarding or penalizing candidate images semantically

Training-free plug-and-play approach enhances zero-shot CIR without model modification

🔎 Similar Papers

iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval