The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Text-to-video (T2V) generation models exhibit high sensitivity to user prompts, yet existing methods lack systematic optimization mechanisms addressing prompt distribution shifts and syntactic structural variations. To address this, we propose RAPO, a retrieval-augmented dual-path prompt optimization framework. First, it constructs a vision–semantics relational graph to guide modifier injection and fine-tunes a large language model (LLM) for semantic refinement. Second, it leverages an instruction-driven pre-trained LLM to rewrite prompts—introducing relational graph modeling to T2V prompt optimization for the first time. RAPO integrates cross-modal retrieval, multi-granularity prompt evaluation, and dual-path synergy. Evaluated across multiple state-of-the-art T2V models, RAPO achieves a 32.7% reduction in Fréchet Video Distance (FVD), significantly improves dynamic coherence and static fidelity, and increases human evaluation pass rate by 41%.

Technology Category

Application Category

📝 Abstract

The evolution of Text-to-video (T2V) generative models, trained on large-scale datasets, has been marked by significant progress. However, the sensitivity of T2V generative models to input prompts highlights the critical role of prompt design in influencing generative outcomes. Prior research has predominantly relied on Large Language Models (LLMs) to align user-provided prompts with the distribution of training prompts, albeit without tailored guidance encompassing prompt vocabulary and sentence structure nuances. To this end, we introduce extbf{RAPO}, a novel extbf{R}etrieval- extbf{A}ugmented extbf{P}rompt extbf{O}ptimization framework. In order to address potential inaccuracies and ambiguous details generated by LLM-generated prompts. RAPO refines the naive prompts through dual optimization branches, selecting the superior prompt for T2V generation. The first branch augments user prompts with diverse modifiers extracted from a learned relational graph, refining them to align with the format of training prompts via a fine-tuned LLM. Conversely, the second branch rewrites the naive prompt using a pre-trained LLM following a well-defined instruction set. Extensive experiments demonstrate that RAPO can effectively enhance both the static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for user-provided prompts. Project website: href{https://whynothaha.github.io/Prompt_optimizer/RAPO.html}{GitHub}.

Problem

Research questions and friction points this paper is trying to address.

Optimizing prompts for better text-to-video generation outcomes

Addressing inaccuracies in LLM-generated prompts for T2V models

Enhancing static and dynamic quality of generated videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented prompt optimization framework

Dual optimization branches for prompt refinement

Enhances static and dynamic video dimensions

🔎 Similar Papers

No similar papers found.