🤖 AI Summary
Existing video captioning models excel at generating descriptive text but exhibit weak instruction-following capabilities, and no dedicated benchmark exists to evaluate this capability. Method: We introduce IF-VidCap—the first instruction-following–focused video captioning benchmark—comprising 1,400 high-quality samples systematically quantifying controllable generation performance along two dimensions: format correctness and content correctness. We design a specialized evaluation framework integrating multi-dimensional human and automated assessments to comprehensively test over 20 state-of-the-art multimodal large models (including both proprietary and open-source variants). Results: Our evaluation reveals that top-tier open-source models now approach proprietary models in instruction-following fidelity; moreover, general-purpose multimodal foundation models significantly outperform specialized dense-captioning models on complex instructions. This work fills a critical gap in controllable video captioning evaluation and advances the joint optimization of descriptive richness and instruction fidelity.
📝 Abstract
Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.