🤖 AI Summary
It remains unclear whether existing vision-language models (VLMs) encode the multi-level aesthetic attributes necessary for personalized image aesthetic assessment. This work is the first to reveal that the language decoder layers of VLMs contain rich and diverse aesthetic representations. Building on this insight, we propose a lightweight, fine-tuning-free approach for personalized aesthetic evaluation: a simple linear model efficiently leverages these internal representations to achieve individual-level aesthetic prediction. Through cross-layer representation analysis, information flow investigation, and extensive experiments across multiple VLM architectures and image domains, our method demonstrates consistently strong performance while incurring minimal computational overhead, establishing a new paradigm for personalized aesthetic assessment.
📝 Abstract
Personalized image aesthetics assessment (PIAA) is an important research problem with practical real-world applications. While methods based on vision-language models (VLMs) are promising candidates for PIAA, it remains unclear whether they internally encode rich, multi-level aesthetic attributes required for effective personalization. In this paper, we first analyze the internal representations of VLMs to examine the presence and distribution of such aesthetic attributes, and then leverage them for lightweight, individual-level personalization without model fine-tuning. Our analysis reveals that VLMs encode diverse aesthetic attributes that propagate into the language decoder layers. Building on these representations, we demonstrate that simple linear models can perform PIAA effectively. We further analyze how aesthetic information is transferred across layers in different VLM architectures and across image domains. Our findings provide insights into how VLMs can be utilized for modeling subjective, individual aesthetic preferences. Our code is available at https://github.com/ynklab/vlm-latent-piaa.