Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) face two key challenges in image aesthetic assessment: scarcity of aesthetic reasoning data and difficulty in modeling subjective human preferences. Method: We propose RAPO, the first framework to jointly optimize absolute score regression and relative preference ranking—unifying these complementary objectives to enhance cross-scenario generalization. RAPO leverages the AesCoT pipeline to generate high-quality chain-of-thought (CoT) aesthetic reasoning data, and employs cold-start pretraining followed by reinforcement learning for fine-grained aesthetic modeling. Contribution/Results: Experiments demonstrate that RAPO achieves average improvements of 47.9% in PLCC and 34.8% in SRCC over mainstream benchmarks, significantly outperforming same-scale state-of-the-art methods. Moreover, it exhibits strong robustness under low-resource and out-of-distribution settings.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) are well suited to image aesthetic assessment, as they can capture high-level aesthetic features leveraging their cross-modal understanding capacity. However, the scarcity of multimodal aesthetic reasoning data and the inherently subjective nature of aesthetic judgment make it difficult for MLLMs to generate accurate aesthetic judgments with interpretable rationales. To this end, we propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL). Concretely, Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data used for cold-start. After teaching the model to generate structured explanations prior to scoring, we then employ the Relative-Absolute Policy Optimization (RAPO), a novel RL algorithm that jointly optimizes absolute score regression and relative ranking order, improving both per-image accuracy and cross-image preference judgments. Aes-R1 enables MLLMs to generate grounded explanations alongside faithful scores, thereby enhancing aesthetic scoring and reasoning in a unified framework. Extensive experiments demonstrate that Aes-R1 improves the backbone's average PLCC/SRCC by 47.9%/34.8%, surpassing state-of-the-art baselines of similar size. More ablation studies validate Aes-R1's robust generalization under limited supervision and in out-of-distribution scenarios.
Problem

Research questions and friction points this paper is trying to address.

Addresses subjective aesthetic judgment in multimodal language models
Overcomes scarcity of multimodal aesthetic reasoning training data
Improves accuracy of aesthetic scoring and preference ranking
Innovation

Methods, ideas, or system contributions that make the work stand out.

AesCoT pipeline constructs filtered chain-of-thought data
Relative-Absolute Policy Optimization jointly optimizes scores and rankings
Generates structured explanations before scoring for unified reasoning
🔎 Similar Papers
No similar papers found.
B
Boyang Liu
Fudan University
Y
Yifan Hu
Tsinghua University
Senjie Jin
Senjie Jin
Fudan University
natural language processing
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
G
Gonglei Shi
Bytedance
Jie Shao
Jie Shao
Professor, University of Electronic Science and Technology of China
MultimediaDatabase
T
Tao Gui
Fudan University
X
Xuanjing Huang
Fudan University