VISTA: A Test-Time Self-Improving Video Generation Agent

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Text-to-video generation critically depends on prompt precision, yet existing test-time optimization methods struggle to simultaneously ensure temporal coherence, semantic fidelity, and cross-modal alignment. To address this, we propose the first test-time self-optimizing multi-agent framework specifically designed for video generation. Our approach integrates structured temporal planning, pairwise competitive filtering, and a ternary critique-reasoning feedback loop to enable autonomous prompt evolution and iterative video quality refinement. The framework unifies audio-visual-contextual joint evaluation, feedback-driven prompt rewriting, and multi-agent collaborative decision-making. Evaluated on single- and multi-scene benchmarks, it consistently outperforms state-of-the-art methods—achieving up to a 60% higher win rate and a 66.4% human preference score—while significantly improving semantic accuracy and temporal continuity of generated videos.

Technology Category

Application Category

📝 Abstract

Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.

Problem

Research questions and friction points this paper is trying to address.

Improving video quality through autonomous prompt refinement

Addressing multifaceted challenges in test-time video optimization

Enhancing video alignment with user intent via iterative generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system autonomously refines prompts iteratively

Decomposes user ideas into structured temporal plans

Specialized agents critique visual, audio, and contextual fidelity

🔎 Similar Papers

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

2024-10-08Citations: 0