🤖 AI Summary
This work addresses the degradation of retrieval performance in interactive text-to-image search caused by hallucinated visual cues in images generated by diffusion models, which often conflict with user intent. To mitigate this issue, the authors propose Diffusion-aware Multi-view Contrastive Learning (DMCL), a framework that jointly optimizes user intent and target image representations by aligning textual queries with diffusion-generated views while suppressing hallucinatory signals. DMCL integrates semantic consistency constraints with a diffusion-aware contrastive objective, enabling the encoder to map hallucinated features into the null space of the embedding space—effectively acting as a semantic filter that enhances robustness against spurious cues. Evaluated on five standard benchmarks, DMCL significantly improves multi-turn Hits@10, outperforming existing fine-tuned and zero-shot baselines by up to 7.37%.
📝 Abstract
Diffusion-Augmented Interactive Text-to-Image Retrieval (DAI-TIR) is a promising paradigm that improves retrieval performance by generating query images via diffusion models and using them as additional ``views''of the user's intent. However, these generative views can be incorrect because diffusion generation may introduce hallucinated visual cues that conflict with the original query text. Indeed, we empirically demonstrate that these hallucinated cues can substantially degrade DAI-TIR performance. To address this, we propose Diffusion-aware Multi-view Contrastive Learning (DMCL), a hallucination-robust training framework that casts DAI-TIR as joint optimization over representations of query intent and the target image. DMCL introduces semantic-consistency and diffusion-aware contrastive objectives to align textual and diffusion-generated query views while suppressing hallucinated query signals. This yields an encoder that acts as a semantic filter, effectively mapping hallucinated cues into a null space, improving robustness to spurious cues and better representing the user's intent. Attention visualization and geometric embedding-space analyses corroborate this filtering behavior. Across five standard benchmarks, DMCL delivers consistent improvements in multi-round Hits@10, reaching as high as 7.37\% over prior fine-tuned and zero-shot baselines, which indicates it is a general and robust training framework for DAI-TIR.