🤖 AI Summary
Multimodal large language models (MLLMs) face two key challenges in image aesthetic assessment: scarcity of aesthetic reasoning data and difficulty in modeling subjective human preferences. Method: We propose RAPO, the first framework to jointly optimize absolute score regression and relative preference ranking—unifying these complementary objectives to enhance cross-scenario generalization. RAPO leverages the AesCoT pipeline to generate high-quality chain-of-thought (CoT) aesthetic reasoning data, and employs cold-start pretraining followed by reinforcement learning for fine-grained aesthetic modeling. Contribution/Results: Experiments demonstrate that RAPO achieves average improvements of 47.9% in PLCC and 34.8% in SRCC over mainstream benchmarks, significantly outperforming same-scale state-of-the-art methods. Moreover, it exhibits strong robustness under low-resource and out-of-distribution settings.
📝 Abstract
Multimodal large language models (MLLMs) are well suited to image aesthetic assessment, as they can capture high-level aesthetic features leveraging their cross-modal understanding capacity. However, the scarcity of multimodal aesthetic reasoning data and the inherently subjective nature of aesthetic judgment make it difficult for MLLMs to generate accurate aesthetic judgments with interpretable rationales. To this end, we propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL). Concretely, Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data used for cold-start. After teaching the model to generate structured explanations prior to scoring, we then employ the Relative-Absolute Policy Optimization (RAPO), a novel RL algorithm that jointly optimizes absolute score regression and relative ranking order, improving both per-image accuracy and cross-image preference judgments. Aes-R1 enables MLLMs to generate grounded explanations alongside faithful scores, thereby enhancing aesthetic scoring and reasoning in a unified framework. Extensive experiments demonstrate that Aes-R1 improves the backbone's average PLCC/SRCC by 47.9%/34.8%, surpassing state-of-the-art baselines of similar size. More ablation studies validate Aes-R1's robust generalization under limited supervision and in out-of-distribution scenarios.