VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of large multimodal language models in video aesthetic perception, particularly their limited exploration in fundamental human aesthetic quality assessment tasks. To bridge this gap, we introduce VideoAesBench—the first comprehensive benchmark dedicated to video aesthetics—comprising 1,804 videos sourced from diverse domains including user-generated content (UGC), AI-generated content (AIGC), compression artifacts, robotics, and gaming. The benchmark features a structured annotation framework spanning three dimensions: visual form, style, and emotion, and incorporates diverse question formats such as open-ended descriptions. Systematic evaluation of 23 leading large models reveals that current approaches possess only rudimentary aesthetic perception capabilities, exhibiting incomplete coverage and limited accuracy, thereby offering a critical reference for future research on interpretable video aesthetics.

Technology Category

Application Category

📝 Abstract
Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs'understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics dimensions including visual form related questions from 5 aspects, visual style related questions from 4 aspects, and visual affectiveness questions from 3 aspects. Based on VideoAesBench, we benchmark 23 open-source and commercial large multimodal models. Our findings show that current LMMs only contain basic video aesthetics perception ability, their performance remains incomplete and imprecise. We hope our VideoAesBench can be served as a strong testbed and offer insights for explainable video aesthetics assessment. The data will be released on https://github.com/michaelliyunhao/VideoAesBench
Problem

Research questions and friction points this paper is trying to address.

video aesthetics
large multimodal models
aesthetic quality assessment
benchmark
multimodal perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Aesthetics
Large Multimodal Models
Benchmark
Aesthetic Perception
Multimodal Evaluation
🔎 Similar Papers
No similar papers found.
Y
Yunhao Li
Shanghai Jiao Tong University
S
Sijing Wu
Shanghai Jiao Tong University
Z
Zhilin Gao
Shanghai Jiao Tong University
Zicheng Zhang
Zicheng Zhang
Shanghai AI Lab
Multi-modal LLMQuality assessment
Q
Qi Jia
Shanghai AI Laboratory
Huiyu Duan
Huiyu Duan
Shanghai Jiao Tong University
Multimedia Signal Processing
X
Xiongkuo Min
Shanghai Jiao Tong University
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays