A Systematic Post-Train Framework for Video Generation

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses key challenges in deploying large-scale video diffusion models—namely prompt sensitivity, temporal inconsistency, and high inference costs—by introducing a systematic four-stage post-training framework comprising supervised fine-tuning (SFT), Group Relative Policy Optimization (GRPO) tailored for video diffusion, language model–driven prompt enhancement, and inference optimization. Notably, this study presents the first adaptation of GRPO, a reinforcement learning method, to video generation, synergistically improving instruction following, visual fidelity, and temporal coherence. Experimental results demonstrate that the proposed approach substantially reduces generation artifacts and achieves efficient, stable, and scalable high-quality video synthesis under stringent sampling budget constraints.

📝 Abstract

While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.

Problem

Research questions and friction points this paper is trying to address.

prompt sensitivity

temporal inconsistency

inference cost

video generation

real-world deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-Training Framework

Video Diffusion Models

Group Relative Policy Optimization