🤖 AI Summary
Multimodal Reference Visual Grounding (MRVG) faces significant challenges in distinguishing visually similar objects (e.g., Diet Coke vs. regular Coke) due to fine-grained semantic ambiguity across modalities.
Method: This paper formally defines MRVG as the task of localizing a target object in a query image given a textual description and a set of multimodal reference images from a database. We propose MRVG-Net, a novel framework integrating few-shot object detection with large language model (LLM)-driven cross-image semantic matching to bridge detection and language-guided grounding.
Contribution/Results: We introduce the first dedicated MRVG benchmark dataset and demonstrate that MRVG-Net significantly outperforms state-of-the-art multimodal foundation models—including Qwen2.5-VL-7B—on fine-grained similar-object localization. Key contributions include: (1) a formal task definition for MRVG; (2) a new multimodal fusion architecture; (3) the first task-specific benchmark; and (4) an LLM-enhanced cross-image semantic matching mechanism.
📝 Abstract
Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Diet Coke and regular Coke in an image. In this case, if additional reference images of Diet Coke and regular Coke are available, it can help the visual grounding of similar objects. In this work, we introduce a new task named Multimodal Reference Visual Grounding (MRVG). In this task, a model has access to a set of reference images of objects in a database. Based on these reference images and a language expression, the model is required to detect a target object from a query image. We first introduce a new dataset to study the MRVG problem. Then we introduce a novel method, named MRVG-Net, to solve this visual grounding problem. We show that by efficiently using reference images with few-shot object detection and using Large Language Models (LLMs) for object matching, our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs such as Qwen2.5-VL-7B. Our approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding. Project page with our code and dataset: https://irvlutd.github.io/MultiGrounding