🤖 AI Summary
Current text-to-video (T2V) training predominantly relies on generic video captions, lacking semantic adaptation tailored to generative tasks—leading to poor temporal coherence and weak instruction alignment in generated videos. To address this, we propose VC4VG, the first caption optimization framework explicitly designed for T2V generation. VC4VG introduces a fine-grained, multi-dimensional semantic decomposition schema—covering subjects, actions, scenes, and temporal dynamics—grounded in the intrinsic semantic requirements of T2V models. We further formulate a necessity-based caption refinement principle and establish VC4VG-Bench, a dedicated evaluation benchmark. Closed-loop fine-tuning experiments across multiple T2V models demonstrate that optimized captions significantly improve generation quality: action coherence increases by 23.6%, and instruction-following accuracy rises by 18.4%. All code, annotation tools, and the benchmark are publicly released.
📝 Abstract
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.