VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To bridge the semantic gap between users’ short, ambiguous prompts and text-to-video (T2V) model training data, this paper proposes a two-stage prompt optimization framework grounded in three principles: harmlessness, accuracy, and helpfulness. Methodologically, it introduces the first T2V-specific prompt optimization paradigm, integrating joint textual and video-level preference feedback to jointly align prompt refinement with video generation quality. The framework supports end-to-end optimization via supervised fine-tuning, multi-granularity feedback modeling, and construction of safety-aligned preference data. Experiments demonstrate a 38% improvement in harmful content rejection rate, a 42% increase in user intent fidelity, and a 27% reduction in Fréchet Video Distance (FVD). The framework exhibits strong generalizability—compatible with mainstream T2V models—and synergizes effectively with reinforcement learning from human feedback (RLHF).

Technology Category

Application Category

📝 Abstract
Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at https://github.com/thu-coai/VPO.
Problem

Research questions and friction points this paper is trying to address.

Optimizing prompts to bridge gap between user inputs and detailed descriptions for text-to-video generation
Addressing limitations of current methods in preserving user intent and ensuring safety
Enhancing video quality and alignment through a principled two-stage optimization framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage optimization with SFT and feedback
Aligns prompts via harmlessness, accuracy, helpfulness
Combines text-level and video-level feedback mechanisms
🔎 Similar Papers
No similar papers found.