Video-Bench: Human-Aligned Video Generation Benchmark

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation evaluation benchmarks suffer from two key limitations: conventional multi-metric embedding approaches exhibit low correlation with human preferences, while LLM-based methods, though reasoning-capable, lack deep understanding of video quality and cross-modal consistency. To address this, we introduce the first comprehensive, human-preference-oriented video generation benchmark, pioneering the integration of multimodal large language models (MLLMs) across the entire evaluation domain. We propose a novel few-shot scoring paradigm coupled with a chain-of-query mechanism to enable scalable, structured, and holistic automated assessment across multiple dimensions—including temporal coherence, visual fidelity, and text-video alignment. Empirical validation on state-of-the-art models (e.g., Sora) demonstrates significantly improved human alignment across all dimensions compared to existing benchmarks, greater objectivity in ambiguous cases, and emergent capability surpassing human judgment in certain scenarios.

Technology Category

Application Category

📝 Abstract
Video generation assessment is essential for ensuring that generative models produce visually realistic, high-quality videos while aligning with human expectations. Current video generation benchmarks fall into two main categories: traditional benchmarks, which use metrics and embeddings to evaluate generated video quality across multiple dimensions but often lack alignment with human judgments; and large language model (LLM)-based benchmarks, though capable of human-like reasoning, are constrained by a limited understanding of video quality metrics and cross-modal consistency. To address these challenges and establish a benchmark that better aligns with human preferences, this paper introduces Video-Bench, a comprehensive benchmark featuring a rich prompt suite and extensive evaluation dimensions. This benchmark represents the first attempt to systematically leverage MLLMs across all dimensions relevant to video generation assessment in generative models. By incorporating few-shot scoring and chain-of-query techniques, Video-Bench provides a structured, scalable approach to generated video evaluation. Experiments on advanced models including Sora demonstrate that Video-Bench achieves superior alignment with human preferences across all dimensions. Moreover, in instances where our framework's assessments diverge from human evaluations, it consistently offers more objective and accurate insights, suggesting an even greater potential advantage over traditional human judgment.
Problem

Research questions and friction points this paper is trying to address.

Assessing video generation alignment with human expectations
Addressing limitations in current video quality benchmarks
Leveraging MLLMs for systematic video generation evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages MLLMs for video generation assessment
Uses few-shot scoring and chain-of-query techniques
Aligns benchmark with human preferences effectively
🔎 Similar Papers
No similar papers found.
H
Hui Han
Shanghai Jiao Tong University
S
Siyuan Li
Shanghai Jiao Tong University
J
Jiaqi Chen
Stanford University, Fellou AI, Fudan University
Y
Yiwen Yuan
Carnegie Mellon University
Y
Yuling Wu
Hong Kong Polytechnic University
C
Chak Tou Leong
Carnegie Mellon University
Hanwen Du
Hanwen Du
The Ohio State University
Machine Learning
Junchen Fu
Junchen Fu
University of Glasgow
MultimodalityLLMVideo GenerationRecommender Systems
Youhua Li
Youhua Li
City University of Hong Kong
LLMInformation SystemsData Mining
J
Jie Zhang
Fudan University
C
Chi Zhang
Westlake University
L
Li-jia Li
LiveX AI
Yongxin Ni
Yongxin Ni
National University of Singapore
Recommender Systems