GenIR: Generative Visual Feedback for Mental Image Retrieval

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mental Image Retrieval (MIR) under multi-turn user interaction remains challenging: existing vision-language models (VLMs) rely on abstract textual feedback, which is ill-suited for iteratively refining vague, subjective mental imagery. Method: We propose a generative visual feedback paradigm—leveraging diffusion models to synthesize interpretable, controllable conditional images as direct, per-turn feedback. We formally define the MIR task, introduce the first image-generation-based interactive mechanism, and construct the first high-quality, fully automated multi-turn MIR dataset. Our approach integrates vision-language alignment fine-tuning, a multi-turn retrieval framework, and an automated data synthesis pipeline. Contribution/Results: On the MIR benchmark, our method significantly outperforms prior interactive methods. Empirical results demonstrate that visual feedback substantially improves both retrieval accuracy and user controllability compared to textual feedback.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind, that is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction.
Problem

Research questions and friction points this paper is trying to address.

Bridging vision-language models to real-world mental image retrieval
Providing clear visual feedback for multi-round search refinement
Automating dataset creation for interactive mental image retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative multi-round retrieval paradigm
Diffusion-based image generation feedback
Automated multi-round MIR dataset pipeline
🔎 Similar Papers
No similar papers found.
Diji Yang
Diji Yang
University of California, Santa Cruz
Natural Language ProcessingInformation Retrieval
M
Minghao Liu
University of California Santa Cruz
C
Chung-Hsiang Lo
Northeastern University
Y
Yi Zhang
University of California Santa Cruz
J
James Davis
University of California Santa Cruz