VC4VG: Optimizing Video Captions for Text-to-Video Generation

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-video (T2V) training predominantly relies on generic video captions, lacking semantic adaptation tailored to generative tasks—leading to poor temporal coherence and weak instruction alignment in generated videos. To address this, we propose VC4VG, the first caption optimization framework explicitly designed for T2V generation. VC4VG introduces a fine-grained, multi-dimensional semantic decomposition schema—covering subjects, actions, scenes, and temporal dynamics—grounded in the intrinsic semantic requirements of T2V models. We further formulate a necessity-based caption refinement principle and establish VC4VG-Bench, a dedicated evaluation benchmark. Closed-loop fine-tuning experiments across multiple T2V models demonstrate that optimized captions significantly improve generation quality: action coherence increases by 23.6%, and instruction-following accuracy rises by 18.4%. All code, annotation tools, and the benchmark are publicly released.

Technology Category

Application Category

📝 Abstract
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.
Problem

Research questions and friction points this paper is trying to address.

Optimizing video captions for text-to-video generation training
Developing comprehensive caption framework for video reconstruction needs
Establishing benchmark with T2V-specific evaluation metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework optimizes video captions for generation models
Benchmark introduces multi-dimensional necessity-graded metrics
Methodology decomposes caption elements for video reconstruction
🔎 Similar Papers
No similar papers found.
Y
Yang Du
School of Information, Renmin University of China
Z
Zhuoran Lin
Taobao & Tmall Group of Alibaba
K
Kaiqiang Song
Taobao & Tmall Group of Alibaba
B
Biao Wang
Taobao & Tmall Group of Alibaba
Z
Zhicheng Zheng
Taobao & Tmall Group of Alibaba
Tiezheng Ge
Tiezheng Ge
Senior staff algorithm engineer, Alimama, Alibaba Group
Computer VisionAIGCRecommender Systems
B
Bo Zheng
Taobao & Tmall Group of Alibaba
Qin Jin
Qin Jin
中国人民大学信息学院
人工智能