Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

πŸ“… 2026-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

204K/year
πŸ€– AI Summary
This work addresses the limitation of existing visual aesthetic assessment methods, which predominantly rely on absolute scoring of individual images and fail to capture human relative preferences. The authors propose modeling aesthetic judgment as a comparative selection task among candidate images sharing the same theme, introducing a novel set-wise comparative evaluation paradigm grounded in expert consensus. They construct VAB, a visual aesthetics benchmark comprising 400 tasks and 1,195 images. Experiments reveal that even the strongest current multimodal models correctly identify both the best and worst images in only 26.5% of tasks, substantially below the human expert performance of 68.9%. Notably, fine-tuning a 35B-parameter model with a modest amount of expert data achieves performance approaching that of a 397B-parameter model, demonstrating the efficacy and promise of the proposed paradigm.
πŸ“ Abstract
Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine-tuning a 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.
Problem

Research questions and friction points this paper is trying to address.

aesthetic judgment
multimodal large language models
visual understanding
comparative preference
expert annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

comparative aesthetic judgment
Visual Aesthetic Benchmark
multimodal large language models
expert-grounded evaluation
aesthetic preference ranking
πŸ”Ž Similar Papers
No similar papers found.
Yichen Feng
Yichen Feng
University of California, Santa Barbara
Financial MathematicsStochastic Differential GamesMean Field GamesSystematic RiskPortfolio Allocation
Yuetai Li
Yuetai Li
University of Washington
LLM AgentLLM ReasoningPost-trainingTrustworthy AI
C
Chunjiang Liu
Bake AI
Y
Yuanyuan Chen
Bake AI
Fengqing Jiang
Fengqing Jiang
University of Washington
Large Language ModelPost-trainingSafety and SecurityReasoningReinforcement Learning
Yue Huang
Yue Huang
PhD student, University of Notre Dame
trustworthy AIgenerative modelmachine learningAI for science
Hang Hua
Hang Hua
University of Rochester
Computer VisionNatural Language ProcessingMachine Learning
Zhengqing Yuan
Zhengqing Yuan
PhD student, University of Notre Dame
NLPDeeplearningCV
K
Kaiyuan Zheng
University of Washington
Luyao Niu
Luyao Niu
University of Washington
CPS securitytrustworthy machine learninggame theory and optimization
Bhaskar Ramasubramanian
Bhaskar Ramasubramanian
Western Washington University
reinforcement learningML securityCPS securityformal methodscontrol theory
Basel Alomair
Basel Alomair
King Abdulaziz City for Science and Technology & University of Washington
Information Security and Cryptography
Xiangliang Zhang
Xiangliang Zhang
Leonard C. Bettex Collegiate Professor, Computer Science and Engineering, University of Notre Dame
Machine LearningAI for Science
Misha Sra
Misha Sra
UCSB
Spatial Human-AI InteractionXRHaptics
Zichen Chen
Zichen Chen
UC Santa Barbara
Agentic LLMTrustworthy AIAI SafetySynthetic Data
Radha Poovendran
Radha Poovendran
Professor of ECE, University of Washington
SecurityGamesLearningNetworksCPS
Zhangchen Xu
Zhangchen Xu
University of Washington
(^._.^)οΎ‰Synthetic DataPost-TrainingSafetyFederated Learning