ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models

📅 2025-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) struggle to accurately retrieve and reference relevant images from dialogue history in multi-turn conversations, limiting their multimodal understanding and interactive capabilities. To address this, we formally introduce the *Contextual Image Reference* task—the first of its kind—and present RefDialog, the first high-quality, human-annotated benchmark dataset for evaluating this capability, accompanied by fine-grained evaluation metrics. We propose ImageRef-VL, an instruction-tuning framework that reformulates image reference as a structured instruction-following problem. Extensive experiments demonstrate that open-source VLMs fine-tuned with ImageRef-VL achieve an 88% improvement over the strongest open-source baselines on RefDialog, outperforming leading proprietary models—including GPT-4V and Claude 3 Opus—thereby significantly advancing visual grounding in dynamic, conversational settings.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context -- and systematically investigate VLMs' capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities through instruction fine-tuning on a large-scale, manually curated multimodal conversation dataset. Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs in contextual image referencing tasks. Our code is available at https://github.com/bytedance/ImageRef-VL.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Chatbots
Image Retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

ImageRef-VL
Visual Language Model
Dialogue-Context Image Retrieval
🔎 Similar Papers
No similar papers found.
Jingwei Yi
Jingwei Yi
University of Science and Technology of China
LLM SafetyFederated Learning
J
Junhao Yin
ByteDance
J
Ju Xu
ByteDance
P
Peng Bao
Peking University
Yongliang Wang
Yongliang Wang
Riemann Lab, Huawei Technologies.
PositioningNavigation3D reconstructionSpatial ComputingAutonomous Driving
W
Wei Fan
University of Oxford
H
Hao Wang
Tsinghua University