VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing video generation models struggle to accurately align with complex user intents, and test-time optimization methods are often computationally expensive or require white-box access. To address this, this work proposes the VQQA framework, which introduces a multi-agent visual question answering mechanism into video generation optimization for the first time. The approach dynamically generates visual questions through multiple agents and leverages semantic feedback from vision-language models to construct interpretable optimization signals, enabling efficient closed-loop prompt refinement under black-box conditions. By replacing conventional passive metrics with a natural language interface, VQQA significantly enhances generation quality—achieving gains of 11.57% and 8.43% on T2V-CompBench and VBench2, respectively—and effectively eliminates visual artifacts within just a few optimization steps, outperforming current random search and prompt-tuning baselines.

Technology Category

Application Category

📝 Abstract

Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.

Problem

Research questions and friction points this paper is trying to address.

video generation

user intent alignment

test-time optimization

black-box evaluation

quality improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

VQQA

multi-agent framework

semantic gradients

black-box optimization

video generation quality

🔎 Similar Papers

No similar papers found.

Bosch Group

bangalore, IN

AI Research Scientist, Computer Vision - Facebook Video Intelligence