Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Multimodal large language models (MLLMs) face two key challenges in image aesthetic assessment: scarcity of aesthetic reasoning data and difficulty in modeling subjective human preferences. Method: We propose RAPO, the first framework to jointly optimize absolute score regression and relative preference ranking—unifying these complementary objectives to enhance cross-scenario generalization. RAPO leverages the AesCoT pipeline to generate high-quality chain-of-thought (CoT) aesthetic reasoning data, and employs cold-start pretraining followed by reinforcement learning for fine-grained aesthetic modeling. Contribution/Results: Experiments demonstrate that RAPO achieves average improvements of 47.9% in PLCC and 34.8% in SRCC over mainstream benchmarks, significantly outperforming same-scale state-of-the-art methods. Moreover, it exhibits strong robustness under low-resource and out-of-distribution settings.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) are well suited to image aesthetic assessment, as they can capture high-level aesthetic features leveraging their cross-modal understanding capacity. However, the scarcity of multimodal aesthetic reasoning data and the inherently subjective nature of aesthetic judgment make it difficult for MLLMs to generate accurate aesthetic judgments with interpretable rationales. To this end, we propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL). Concretely, Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data used for cold-start. After teaching the model to generate structured explanations prior to scoring, we then employ the Relative-Absolute Policy Optimization (RAPO), a novel RL algorithm that jointly optimizes absolute score regression and relative ranking order, improving both per-image accuracy and cross-image preference judgments. Aes-R1 enables MLLMs to generate grounded explanations alongside faithful scores, thereby enhancing aesthetic scoring and reasoning in a unified framework. Extensive experiments demonstrate that Aes-R1 improves the backbone's average PLCC/SRCC by 47.9%/34.8%, surpassing state-of-the-art baselines of similar size. More ablation studies validate Aes-R1's robust generalization under limited supervision and in out-of-distribution scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses subjective aesthetic judgment in multimodal language models

Overcomes scarcity of multimodal aesthetic reasoning training data

Improves accuracy of aesthetic scoring and preference ranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

AesCoT pipeline constructs filtered chain-of-thought data

Relative-Absolute Policy Optimization jointly optimizes scores and rankings

Generates structured explanations before scoring for unified reasoning

🔎 Similar Papers

No similar papers found.