🤖 AI Summary
Existing research lacks systematic benchmarking of multimodal large language models’ (MLLMs) understanding and utilization of visual prompts (VPs), such as bounding boxes.
Method: We propose VP-Bench—the first dedicated benchmark for evaluating MLLMs’ VP capabilities—featuring a two-stage evaluation framework: “perception & recognition” and “downstream task impact.” It encompasses 30K VPs spanning eight geometric shapes and 355 attribute combinations, integrating human annotation with automated testing to quantify VP effectiveness in tasks like referring expression comprehension, while analyzing impacts of VP attributes, question ordering, and model scale.
Results: Evaluating 28 state-of-the-art MLLMs—including GPT-4o, InternVL3, and Qwen2.5-VL—VP-Bench reveals significant limitations in VP perception robustness, fine-grained localization accuracy, and cross-task generalization. This work establishes a novel evaluation paradigm and actionable insights for advancing VP-driven multimodal reasoning.
📝 Abstract
Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use"visual prompts"(VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs'capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models'ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.