ProactiveBench: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models

📅 2025-07-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video large language models (VLLMs) lack a comprehensive benchmark for evaluating proactive interaction capabilities—specifically, the ability to autonomously determine *when* to respond during video playback. Method: We introduce ProactiveBench, the first benchmark dedicated to assessing VLLM proactive interaction, built upon a time-aware evaluation framework and human-annotated multi-scenario video-dialogue data. Its core innovation is the Proactive Area Under Curve (PAUC) metric, which quantifies response timing dynamics—capturing human preferences for natural, human-like interaction. Benchmark validity is confirmed via user studies. Contribution/Results: Experiments demonstrate that PAUC achieves significantly higher correlation with human preference than conventional text-matching metrics (e.g., BLEU, BERTScore). ProactiveBench thus establishes a reproducible, user-aligned standard for evaluating proactive interaction in VLLMs, enabling principled assessment of temporal decision-making in video-grounded dialogue.

Technology Category

Application Category

📝 Abstract
With the growing research focus on multimodal dialogue systems, the capability for proactive interaction is gradually gaining recognition. As an alternative to conventional turn-by-turn dialogue, users increasingly expect multimodal systems to be more initiative, for example, by autonomously determining the timing of multi-turn responses in real time during video playback. To facilitate progress in this emerging area, we introduce ProactiveBench, the first comprehensive benchmark to evaluate a system's ability to engage in proactive interaction. Since model responses are generated at varying timestamps, we further propose PAUC, the first metric that accounts for the temporal dynamics of model responses. This enables a more accurate evaluation of systems operating in proactive settings. Through extensive benchmarking of various baseline systems on ProactiveBench and a user study of human preferences, we show that PAUC is in better agreement with human preferences than traditional evaluation metrics, which typically only consider the textual content of responses. These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios. Project homepage: https://github.com/yellow-binary-tree/ProactiveBench
Problem

Research questions and friction points this paper is trying to address.

Evaluate proactive interaction in video language models
Develop metric for temporal dynamics of responses
Align system evaluation with human preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

ProactiveBench evaluates proactive video interactions
PAUC metric assesses response timing dynamics
PAUC aligns better with human preferences
🔎 Similar Papers
No similar papers found.