Optical-Flow Guided Prompt Optimization for Coherent Video Generation

📅 2024-11-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-video diffusion models suffer from severe temporal incoherence, particularly manifesting as motion distortion in long-sequence generation. To address this, we propose MotionPrompt—a novel framework that introduces optical flow modeling into prompt optimization for the first time. During reverse diffusion sampling, MotionPrompt represents motion priors via learnable token embeddings and performs adversarial optimization guided by gradient signals from an optical flow discriminator, enabling general-purpose temporal consistency enhancement without requiring additional video fine-tuning. The method is model-agnostic and compatible with mainstream text-to-video diffusion architectures. Experiments demonstrate consistent improvements across multiple benchmarks: a 18.3% reduction in Fréchet Video Distance (FVD) and a 32.7% increase in motion consistency score, while preserving content fidelity. Our core contribution lies in establishing an optical-flow-guided prompt embedding optimization paradigm, offering a new perspective on temporal modeling for diffusion-based video generation.

Technology Category

Application Category

📝 Abstract
While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing temporal consistency in video generation
Optimizing prompts via optical flow guidance
Improving motion dynamics without content fidelity loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optical flow-guided video generation framework
Discriminator-trained token embedding optimization
Enhanced temporal consistency via motion dynamics
🔎 Similar Papers
No similar papers found.