T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

📅 2024-10-08

🏛️ arXiv.org

📈 Citations: 31

✨ Influential: 2

career value

182K/year

🤖 AI Summary

To address the limited generation quality of text-to-video (T2V) models during post-training, this paper proposes a unified optimization framework based on consistency model distillation. Methodologically, it is the first to jointly integrate high-quality data customization, multi-source reward modeling (VBench and T2V-CompBench), and energy-function-based ODE conditional guidance into the distillation pipeline; further, it innovatively enables explicit motion feature disentanglement, extraction, and solver-level injection. The key contribution is the establishment of the first consistency distillation paradigm jointly guided by data, rewards, and conditioning—significantly enhancing temporal coherence and motion fidelity. Experiments demonstrate that our method achieves 85.13 on VBench, outperforming closed-source models including Gen-3 and Kling, with substantial improvements in motion quality metrics.

Technology Category

Application Category

📝 Abstract

In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the post-training phase by distilling a highly capable consistency model from a pretrained T2V model. Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals, including high-quality training data, reward model feedback, and conditional guidance, into the consistency distillation process. Through comprehensive ablation studies, we highlight the crucial importance of tailoring datasets to specific learning objectives and the effectiveness of learning from diverse reward models for enhancing both the visual quality and text-video alignment. Additionally, we highlight the vast design space of conditional guidance strategies, which centers on designing an effective energy function to augment the teacher ODE solver. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver, showcasing its effectiveness in improving the motion quality of the generated videos with the improved motion-related metrics from VBench and T2V-CompBench. Empirically, our T2V-Turbo-v2 establishes a new state-of-the-art result on VBench, with a Total score of 85.13, surpassing proprietary systems such as Gen-3 and Kling.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video generation model post-training quality

Improving text-video alignment through reward feedback

Designing conditional guidance for better motion generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Consistency distillation from pretrained T2V model

Integrating reward feedback and conditional guidance

Designing energy function for teacher ODE solver

🔎 Similar Papers

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way