Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing vision-language models struggle to identify aesthetic issues in images and provide actionable photography and cropping suggestions, limiting their effectiveness in composition optimization tasks. To address this, this work proposes Venus, a framework trained in two stages: first, progressive complexity-based aesthetic question answering enhances the model’s ability to deliver aesthetic guidance; second, chain-of-thought (CoT) reasoning is leveraged to activate its aesthetic cropping capability. We introduce AesGuide, the first large-scale dataset for aesthetic instruction, and develop the first interpretable, interactive multimodal large language model system for aesthetic guidance and cropping. Experiments demonstrate that Venus achieves state-of-the-art performance on both aesthetic guidance and cropping tasks, enabling end-to-end, full-pipeline aesthetic optimization.

Technology Category

Application Category

📝 Abstract

The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) -- an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. Extensive experiments show that Venus substantially improves AG capability and achieves state-of-the-art (SOTA) performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across both stages of photo creation. Code is available at https://github.com/PKU-ICST-MIPL/Venus_CVPR2026.

Problem

Research questions and friction points this paper is trying to address.

aesthetic guidance

multimodal large language models

computational aesthetics

aesthetic cropping

photo composition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aesthetic Guidance

Multimodal Large Language Models

Aesthetic Cropping