🤖 AI Summary
Existing video captioning evaluation benchmarks lack fine-grained spatiotemporal characterization, hindering precise optimization of text-to-video generation models. To address this, we introduce VCapsBench—the first large-scale, fine-grained benchmark for video caption evaluation—comprising 5,677 videos and 109,796 question-answer pairs. We systematically define 21 critical spatiotemporal dimensions governing caption quality. We propose three novel, interpretable metrics: Alignment Rate (AR), Informativeness Rate (IR), and Consistency Rate (CR). Furthermore, we design an LLM-based comparative question-answering pipeline for automated caption assessment. Our framework synergistically integrates fine-grained human annotation with scalable automated analysis, substantially improving evaluation accuracy and interpretability. Empirical results demonstrate that VCapsBench provides reliable, dual-axis feedback—enabling simultaneous optimization of semantic coherence and visual fidelity in text-to-video models.
📝 Abstract
Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: https://github.com/GXYM/VCapsBench.