IF-VidCap: Can Video Caption Models Follow Instructions?

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video captioning models excel at generating descriptive text but exhibit weak instruction-following capabilities, and no dedicated benchmark exists to evaluate this capability. Method: We introduce IF-VidCap—the first instruction-following–focused video captioning benchmark—comprising 1,400 high-quality samples systematically quantifying controllable generation performance along two dimensions: format correctness and content correctness. We design a specialized evaluation framework integrating multi-dimensional human and automated assessments to comprehensively test over 20 state-of-the-art multimodal large models (including both proprietary and open-source variants). Results: Our evaluation reveals that top-tier open-source models now approach proprietary models in instruction-following fidelity; moreover, general-purpose multimodal foundation models significantly outperform specialized dense-captioning models on complex instructions. This work fills a critical gap in controllable video captioning evaluation and advances the joint optimization of descriptive richness and instruction fidelity.

Technology Category

Application Category

📝 Abstract
Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.
Problem

Research questions and friction points this paper is trying to address.

Evaluating instruction-following capabilities in video captioning models
Assessing format and content correctness in controllable video descriptions
Benchmarking models on specific user instructions rather than general descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces IF-VidCap benchmark for controllable video captioning
Systematically evaluates format and content correctness dimensions
Reveals open-source models achieving near-parity with proprietary ones
🔎 Similar Papers
No similar papers found.
S
Shihao Li
Nanjing University
Yuanxing Zhang
Yuanxing Zhang
Kuaishou Technology
Recommender SystemLarge Language ModelVideo Understanding
Jiangtao Wu
Jiangtao Wu
PhD Student of Solid Mechanics,Georgia Institute of Technology
Solid mechanics3D printingShape memory polymerMolecular dynamicsDensity functional theory
Z
Zhide Lei
Nanjing University
Y
Yiwen He
Nanjing University
R
Runzhe Wen
Nanjing University
C
Chenxi Liao
Nanjing University
C
Chengkang Jiang
Nanjing University
A
An Ping
Nanjing University
Shuo Gao
Shuo Gao
Beihang University, University of Cambridge (Ph.D.)
AI for HealthcareWearable SystemsHuman Body Digital TwinsNeural Computing
S
Suhan Wang
Nanjing University
Z
Zhaozhou Bian
Nanjing University
Z
Zijun Zhou
Shanghai University
Jingyi Xie
Jingyi Xie
Assistant Professor, San José State University
Human-Computer InteractionAccessibilityHuman-Centered AI
J
Jiayi Zhou
Nanjing University
J
Jing Wang
Nanjing University
Yifan Yao
Yifan Yao
Drexel University
W
Weihao Xie
Nanjing University
Y
Yingshui Tan
M-A-P
Y
Yanghai Wang
Nanjing University
Qianqian Xie
Qianqian Xie
Wuhan University
NLPLLM
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning
J
Jiaheng Liu
Nanjing University