VISTA: A Test-Time Self-Improving Video Generation Agent

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-video generation critically depends on prompt precision, yet existing test-time optimization methods struggle to simultaneously ensure temporal coherence, semantic fidelity, and cross-modal alignment. To address this, we propose the first test-time self-optimizing multi-agent framework specifically designed for video generation. Our approach integrates structured temporal planning, pairwise competitive filtering, and a ternary critique-reasoning feedback loop to enable autonomous prompt evolution and iterative video quality refinement. The framework unifies audio-visual-contextual joint evaluation, feedback-driven prompt rewriting, and multi-agent collaborative decision-making. Evaluated on single- and multi-scene benchmarks, it consistently outperforms state-of-the-art methods—achieving up to a 60% higher win rate and a 66.4% human preference score—while significantly improving semantic accuracy and temporal continuity of generated videos.

Technology Category

Application Category

📝 Abstract
Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.
Problem

Research questions and friction points this paper is trying to address.

Improving video quality through autonomous prompt refinement
Addressing multifaceted challenges in test-time video optimization
Enhancing video alignment with user intent via iterative generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system autonomously refines prompts iteratively
Decomposes user ideas into structured temporal plans
Specialized agents critique visual, audio, and contextual fidelity