V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

📅 2025-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video understanding benchmarks rely solely on textual prompts, lacking spatiotemporal precision and thus failing to support natural human–machine interaction. To address this, we introduce V2P-Bench—the first vision-prompted video-language understanding benchmark—comprising 980 videos and 1,172 question-answer pairs, covering five task categories and twelve fine-grained evaluation dimensions. Crucially, it pioneers vision-based prompting (rather than text-only input) to enable explicit spatiotemporal alignment. The benchmark is built upon rigorously human-annotated, structured data that jointly evaluates spatiotemporal localization, event reasoning, and cross-frame association. Experiments reveal substantial performance gaps: GPT-4o and Gemini-1.5-Pro achieve only 65.4% and 67.9% accuracy, markedly below human experts (88.3%), exposing a fundamental limitation of current large vision-language models in interpreting visual prompts. V2P-Bench thus shifts the evaluation paradigm from language-centric to human-interaction-centric assessment.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: https://github.com/gaotiexinqu/V2P-Bench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs' video understanding with visual prompts
Addressing text prompts' spatial-temporal reference limitations
Benchmarking multimodal human-model interaction in video tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces V2P-Bench for video visual prompts
Evaluates LVLMs with 980 videos and 1,172 QAs
Highlights LVLMs' poor performance on visual prompts
🔎 Similar Papers
No similar papers found.
Y
Yiming Zhao
University of Science and Technology of China
Y
Yu Zeng
University of Science and Technology of China
Yukun Qi
Yukun Qi
中国科学技术大学
YaoYang Liu
YaoYang Liu
HKUST
L
Lin Chen
University of Science and Technology of China
Zehui Chen
Zehui Chen
USTC
X
Xikun Bao
University of Science and Technology of China
J
Jie Zhao
F
Feng Zhao
University of Science and Technology of China