VC4VG: Optimizing Video Captions for Text-to-Video Generation

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current text-to-video (T2V) training predominantly relies on generic video captions, lacking semantic adaptation tailored to generative tasks—leading to poor temporal coherence and weak instruction alignment in generated videos. To address this, we propose VC4VG, the first caption optimization framework explicitly designed for T2V generation. VC4VG introduces a fine-grained, multi-dimensional semantic decomposition schema—covering subjects, actions, scenes, and temporal dynamics—grounded in the intrinsic semantic requirements of T2V models. We further formulate a necessity-based caption refinement principle and establish VC4VG-Bench, a dedicated evaluation benchmark. Closed-loop fine-tuning experiments across multiple T2V models demonstrate that optimized captions significantly improve generation quality: action coherence increases by 23.6%, and instruction-following accuracy rises by 18.4%. All code, annotation tools, and the benchmark are publicly released.

Technology Category

Application Category

📝 Abstract

Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.

Problem

Research questions and friction points this paper is trying to address.

Optimizing video captions for text-to-video generation training

Developing comprehensive caption framework for video reconstruction needs

Establishing benchmark with T2V-specific evaluation metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework optimizes video captions for generation models

Benchmark introduces multi-dimensional necessity-graded metrics

Methodology decomposes caption elements for video reconstruction

🔎 Similar Papers

No similar papers found.

Authors to Follow