VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing research lacks systematic benchmarking of multimodal large language models’ (MLLMs) understanding and utilization of visual prompts (VPs), such as bounding boxes. Method: We propose VP-Bench—the first dedicated benchmark for evaluating MLLMs’ VP capabilities—featuring a two-stage evaluation framework: “perception & recognition” and “downstream task impact.” It encompasses 30K VPs spanning eight geometric shapes and 355 attribute combinations, integrating human annotation with automated testing to quantify VP effectiveness in tasks like referring expression comprehension, while analyzing impacts of VP attributes, question ordering, and model scale. Results: Evaluating 28 state-of-the-art MLLMs—including GPT-4o, InternVL3, and Qwen2.5-VL—VP-Bench reveals significant limitations in VP perception robustness, fine-grained localization accuracy, and cross-task generalization. This work establishes a novel evaluation paradigm and actionable insights for advancing VP-driven multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use"visual prompts"(VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs'capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models'ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to interpret visual prompts like bounding boxes

Assessing how visual prompts impact performance on downstream tasks

Providing a benchmark for VP perception and utilization in MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VP-Bench benchmark for visual prompting evaluation

Uses two-stage framework testing perception and task performance

Evaluates 28 MLLMs with diverse shapes and attributes

🔎 Similar Papers

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want