🤖 AI Summary
Existing vision-language models (VLMs) struggle to accurately retrieve and reference relevant images from dialogue history in multi-turn conversations, limiting their multimodal understanding and interactive capabilities. To address this, we formally introduce the *Contextual Image Reference* task—the first of its kind—and present RefDialog, the first high-quality, human-annotated benchmark dataset for evaluating this capability, accompanied by fine-grained evaluation metrics. We propose ImageRef-VL, an instruction-tuning framework that reformulates image reference as a structured instruction-following problem. Extensive experiments demonstrate that open-source VLMs fine-tuned with ImageRef-VL achieve an 88% improvement over the strongest open-source baselines on RefDialog, outperforming leading proprietary models—including GPT-4V and Claude 3 Opus—thereby significantly advancing visual grounding in dynamic, conversational settings.
📝 Abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context -- and systematically investigate VLMs' capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities through instruction fine-tuning on a large-scale, manually curated multimodal conversation dataset. Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs in contextual image referencing tasks. Our code is available at https://github.com/bytedance/ImageRef-VL.