VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing video captioning evaluation benchmarks lack fine-grained spatiotemporal characterization, hindering precise optimization of text-to-video generation models. To address this, we introduce VCapsBench—the first large-scale, fine-grained benchmark for video caption evaluation—comprising 5,677 videos and 109,796 question-answer pairs. We systematically define 21 critical spatiotemporal dimensions governing caption quality. We propose three novel, interpretable metrics: Alignment Rate (AR), Informativeness Rate (IR), and Consistency Rate (CR). Furthermore, we design an LLM-based comparative question-answering pipeline for automated caption assessment. Our framework synergistically integrates fine-grained human annotation with scalable automated analysis, substantially improving evaluation accuracy and interpretability. Empirical results demonstrate that VCapsBench provides reliable, dual-axis feedback—enabling simultaneous optimization of semantic coherence and visual fidelity in text-to-video models.

Technology Category

Application Category

📝 Abstract

Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.

Problem

Research questions and friction points this paper is trying to address.

Lack of fine-grained benchmarks for video caption evaluation

Inadequate assessment of spatial-temporal details in captions

Need for actionable metrics to optimize caption quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale fine-grained video caption benchmark

Systematic annotation across 21 dimensions

Automated evaluation pipeline using LLM

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs