WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in zero-shot compositional image retrieval (ZS-CIR) of simultaneously preserving image details and accurately reflecting textual semantic modifications without any training. We propose the first training-free framework that adaptively fuses text-to-image (T2I) and image-to-image (I2I) pathways through a “retrieve–verify–refine” pipeline, enabling joint optimization guided by both intent awareness and uncertainty awareness. Our approach introduces a novel structured self-reflection–guided refinement strategy and a dynamic fusion mechanism. It achieves relative performance improvements of 45% (mAP@5) on CIRCO and 57% (Recall@1) on CIRR, significantly outperforming most trainable methods while demonstrating strong cross-scenario generalization capabilities.

Technology Category

Application Category

📝 Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.
Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Composed Image Retrieval
Multimodal Query
Text-to-Image Retrieval
Image-to-Image Retrieval
Training-Free
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
composed image retrieval
adaptive fusion
intent awareness
uncertainty-aware retrieval
🔎 Similar Papers
No similar papers found.