🤖 AI Summary
Existing video generation evaluation methods suffer from three critical limitations: overly simplistic prompts, evaluation operators lacking generalization—particularly on out-of-distribution (OOD) samples—and substantial misalignment between automated metrics and human preferences. To address these issues, we propose the first intelligent agent-based evaluation system tailored for state-of-the-art video generation models. Our method introduces a novel dynamic assessment architecture that synergistically integrates large language models (LLMs) and multimodal LLMs (MLLMs), designs a temporally aware, scalable patching tool, and establishes the first benchmark comprising 700 structured prompts and 12,000 high-quality videos. The system supports both text-to-video (T2V) and image-to-video (I2V) evaluation modalities. Empirical results demonstrate strong agreement with human judgments (Spearman ρ > 0.89), robust performance across 20+ models, and statistically significant superiority over conventional metrics—including FID and CLIP-Score—on eight SOTA models.
📝 Abstract
The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.