VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research lacks systematic benchmarking of multimodal large language models’ (MLLMs) understanding and utilization of visual prompts (VPs), such as bounding boxes. Method: We propose VP-Bench—the first dedicated benchmark for evaluating MLLMs’ VP capabilities—featuring a two-stage evaluation framework: “perception & recognition” and “downstream task impact.” It encompasses 30K VPs spanning eight geometric shapes and 355 attribute combinations, integrating human annotation with automated testing to quantify VP effectiveness in tasks like referring expression comprehension, while analyzing impacts of VP attributes, question ordering, and model scale. Results: Evaluating 28 state-of-the-art MLLMs—including GPT-4o, InternVL3, and Qwen2.5-VL—VP-Bench reveals significant limitations in VP perception robustness, fine-grained localization accuracy, and cross-task generalization. This work establishes a novel evaluation paradigm and actionable insights for advancing VP-driven multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use"visual prompts"(VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs'capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models'ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to interpret visual prompts like bounding boxes
Assessing how visual prompts impact performance on downstream tasks
Providing a benchmark for VP perception and utilization in MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VP-Bench benchmark for visual prompting evaluation
Uses two-stage framework testing perception and task performance
Evaluates 28 MLLMs with diverse shapes and attributes
🔎 Similar Papers
No similar papers found.
Mingjie Xu
Mingjie Xu
The Hong Kong University of Science and Technology (Guangzhou)
Jinpeng Chen
Jinpeng Chen
City University of Hong Kong
Continual LearningMultimodal Large Language Model
Yuzhi Zhao
Yuzhi Zhao
Ph.D., City University of Hong Kong; B.Eng., Huazhong University of Science and Technology
Low-level VisionComputational PhotographyLLMMLLM
Jason Chun Lok Li
Jason Chun Lok Li
The University of Hong Kong
AgentEfficient Neural NetworksCompressionImplicit Neural Representation
Y
Yue Qiu
Huazhong University of Science and Technology
Z
Zekang Du
Huazhong University of Science and Technology
Mengyang Wu
Mengyang Wu
The Chinese University of Hong Kong
MLLM3D Vision
P
Pingping Zhang
City University of Hong Kong
K
Kun Li
City University of Hong Kong
H
Hongzheng Yang
The Chinese University of Hong Kong
W
Wenao Ma
The Chinese University of Hong Kong
J
Jiaheng Wei
The Hong Kong University of Science and Technology (Guangzhou)
Qinbin Li
Qinbin Li
Professor, Computer Science, Huazhong University of Science and Technology
Machine Learning SystemData ScienceFederated Learning
K
Kangcheng Liu
Hunan University
W
Wenqiang Lei
Sichuan University