Eliminating Hallucination in Diffusion-Augmented Interactive Text-to-Image Retrieval

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation of retrieval performance in interactive text-to-image search caused by hallucinated visual cues in images generated by diffusion models, which often conflict with user intent. To mitigate this issue, the authors propose Diffusion-aware Multi-view Contrastive Learning (DMCL), a framework that jointly optimizes user intent and target image representations by aligning textual queries with diffusion-generated views while suppressing hallucinatory signals. DMCL integrates semantic consistency constraints with a diffusion-aware contrastive objective, enabling the encoder to map hallucinated features into the null space of the embedding space—effectively acting as a semantic filter that enhances robustness against spurious cues. Evaluated on five standard benchmarks, DMCL significantly improves multi-turn Hits@10, outperforming existing fine-tuned and zero-shot baselines by up to 7.37%.

Technology Category

Application Category

📝 Abstract
Diffusion-Augmented Interactive Text-to-Image Retrieval (DAI-TIR) is a promising paradigm that improves retrieval performance by generating query images via diffusion models and using them as additional ``views''of the user's intent. However, these generative views can be incorrect because diffusion generation may introduce hallucinated visual cues that conflict with the original query text. Indeed, we empirically demonstrate that these hallucinated cues can substantially degrade DAI-TIR performance. To address this, we propose Diffusion-aware Multi-view Contrastive Learning (DMCL), a hallucination-robust training framework that casts DAI-TIR as joint optimization over representations of query intent and the target image. DMCL introduces semantic-consistency and diffusion-aware contrastive objectives to align textual and diffusion-generated query views while suppressing hallucinated query signals. This yields an encoder that acts as a semantic filter, effectively mapping hallucinated cues into a null space, improving robustness to spurious cues and better representing the user's intent. Attention visualization and geometric embedding-space analyses corroborate this filtering behavior. Across five standard benchmarks, DMCL delivers consistent improvements in multi-round Hits@10, reaching as high as 7.37\% over prior fine-tuned and zero-shot baselines, which indicates it is a general and robust training framework for DAI-TIR.
Problem

Research questions and friction points this paper is trying to address.

hallucination
diffusion models
text-to-image retrieval
interactive retrieval
visual cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-aware Multi-view Contrastive Learning
Hallucination suppression
Text-to-image retrieval
Semantic filtering
Diffusion models
🔎 Similar Papers
No similar papers found.
Zhuocheng Zhang
Zhuocheng Zhang
Institute of Computing Technology, Chinese Academy of Science
Natural Language Processing
K
Kangheng Liang
University of Glasgow
G
Guanxuan Li
Hunan University
Paul Henderson
Paul Henderson
University of Glasgow
computer visionmachine learning
R
R. McCreadie
University of Glasgow
Z
Zijun Long
Hunan University