VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the limitation of existing video generation evaluation benchmarks, which predominantly emphasize technical fidelity while neglecting perceptual and artistic aesthetic quality. To bridge this gap, we propose VGA-Bench, the first fine-grained, three-tiered evaluation framework that jointly assesses aesthetic quality, aesthetic attributes, and generation quality. We construct a large-scale annotated dataset comprising over 60,000 videos and integrate multi-task neural networks—VAQA-Net, VTag-Net, and VGQA-Net—with human ratings and diverse prompt engineering strategies. Extensive validation across twelve state-of-the-art video generation models demonstrates strong alignment between our benchmark’s assessments and human judgments. The benchmark is publicly released to support applications such as content moderation, model debugging, and optimization.

Technology Category

Application Category

📝 Abstract

The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment-particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.

Problem

Research questions and friction points this paper is trying to address.

video aesthetics

generation quality

evaluation benchmark

AIGC

perceptual quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

video aesthetics

generation quality evaluation

unified benchmark