IF-VidCap: Can Video Caption Models Follow Instructions?

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing video captioning models excel at generating descriptive text but exhibit weak instruction-following capabilities, and no dedicated benchmark exists to evaluate this capability. Method: We introduce IF-VidCap—the first instruction-following–focused video captioning benchmark—comprising 1,400 high-quality samples systematically quantifying controllable generation performance along two dimensions: format correctness and content correctness. We design a specialized evaluation framework integrating multi-dimensional human and automated assessments to comprehensively test over 20 state-of-the-art multimodal large models (including both proprietary and open-source variants). Results: Our evaluation reveals that top-tier open-source models now approach proprietary models in instruction-following fidelity; moreover, general-purpose multimodal foundation models significantly outperform specialized dense-captioning models on complex instructions. This work fills a critical gap in controllable video captioning evaluation and advances the joint optimization of descriptive richness and instruction fidelity.

Technology Category

Application Category

📝 Abstract

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.

Problem

Research questions and friction points this paper is trying to address.

Evaluating instruction-following capabilities in video captioning models

Assessing format and content correctness in controllable video descriptions

Benchmarking models on specific user instructions rather than general descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces IF-VidCap benchmark for controllable video captioning

Systematically evaluates format and content correctness dimensions

Reveals open-source models achieving near-parity with proprietary ones

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs