🤖 AI Summary
This work addresses the lack of systematic evaluation of large multimodal language models in video aesthetic perception, particularly their limited exploration in fundamental human aesthetic quality assessment tasks. To bridge this gap, we introduce VideoAesBench—the first comprehensive benchmark dedicated to video aesthetics—comprising 1,804 videos sourced from diverse domains including user-generated content (UGC), AI-generated content (AIGC), compression artifacts, robotics, and gaming. The benchmark features a structured annotation framework spanning three dimensions: visual form, style, and emotion, and incorporates diverse question formats such as open-ended descriptions. Systematic evaluation of 23 leading large models reveals that current approaches possess only rudimentary aesthetic perception capabilities, exhibiting incomplete coverage and limited accuracy, thereby offering a critical reference for future research on interpretable video aesthetics.
📝 Abstract
Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs'understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics dimensions including visual form related questions from 5 aspects, visual style related questions from 4 aspects, and visual affectiveness questions from 3 aspects. Based on VideoAesBench, we benchmark 23 open-source and commercial large multimodal models. Our findings show that current LMMs only contain basic video aesthetics perception ability, their performance remains incomplete and imprecise. We hope our VideoAesBench can be served as a strong testbed and offer insights for explainable video aesthetics assessment. The data will be released on https://github.com/michaelliyunhao/VideoAesBench