RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations in text-to-video (T2V) generation—namely, user prompts that are short, syntactically unstructured, and misaligned with training data—this paper proposes RAPO++, a cross-stage prompt optimization framework. RAPO++ comprises three synergistic stages: (1) relation-graph-guided retrieval augmentation to improve prompt–training-data alignment; (2) test-time closed-loop iterative optimization leveraging multi-source feedback (semantic fidelity, spatiotemporal consistency, and optical flow); and (3) lightweight fine-tuning of a large language model to enhance prompt structuring capability. Crucially, RAPO++ operates without modifying the underlying T2V generator, ensuring model-agnosticism and high scalability. Evaluated across five state-of-the-art T2V models and five benchmarks, RAPO++ achieves significant improvements in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, consistently outperforming existing prompt optimization methods.

Technology Category

Application Category

📝 Abstract
Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present extbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In extbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. extbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. extbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.
Problem

Research questions and friction points this paper is trying to address.

Optimizing short user prompts for better text-to-video generation quality
Aligning prompts with training data to enhance compositionality and fidelity
Iteratively refining prompts using multi-source feedback for improved video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-stage prompt optimization framework for text-to-video generation
Retrieval-augmented prompt optimization with training-data alignment
Test-time iterative scaling using multi-source feedback mechanisms
🔎 Similar Papers
No similar papers found.
Bingjie Gao
Bingjie Gao
Shanghai Jiao Tong University
computer vision
Q
Qianli Ma
Shanghai Jiao Tong University, Shanghai 200240, China
Xiaoxue Wu
Xiaoxue Wu
Fudan University
video generation
S
Shuai Yang
Shanghai Jiao Tong University, Shanghai 200240, China
Guanzhou Lan
Guanzhou Lan
Northwestern Polytechnical University
Computer VisionEmbodied AI
H
Haonan Zhao
Shanghai Jiao Tong University, Shanghai 200240, China
J
Jiaxuan Chen
Shanghai Jiao Tong University, Shanghai 200240, China
Q
Qingyang Liu
Shanghai Jiao Tong University, Shanghai 200240, China
Y
Yu Qiao
Shanghai Artificial Intelligence Laboratory, Shanghai 201112, China
X
Xinyuan Chen
Shanghai Artificial Intelligence Laboratory, Shanghai 201112, China
Yaohui Wang
Yaohui Wang
Research Scientist, Shanghai AI Laboratory | Inria
Machine LearningDeep Generative ModelsVideo Generation
Li Niu
Li Niu
Shanghai Jiao Tong University
computer visionmachine learningdeep learning