P-Flow: Prompting Visual Effects Generation

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of precisely customizing complex dynamic visual effects—such as explosions or shattering—in text-to-video generation, which existing models struggle with due to their reliance on manually crafted, highly specific prompts. To overcome this limitation, we propose P-Flow, a training-free inference-time framework that leverages the semantic and temporal understanding capabilities of vision-language models. By iteratively comparing visual discrepancies between a reference video and the generated output, P-Flow automatically refines the input text prompt without modifying the underlying generative model. Our approach enables high-fidelity, diverse customization of dynamic effects in both text-to-video and image-to-video tasks, significantly outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearance-driven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks. Code is available at https://github.com/showlab/P-Flow.
Problem

Research questions and friction points this paper is trying to address.

dynamic visual effects
video generation
text prompting
temporal reasoning
prompt customization
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt optimization
dynamic visual effects
video generation
vision-language models
training-free customization
🔎 Similar Papers
No similar papers found.