π€ AI Summary
This work addresses the challenge that existing large vision-language models (VLMs) struggle to perform expert-level appreciation of specialized art forms such as traditional Chinese painting. To bridge this gap, we propose HanMoVLMβthe first VLM specifically tailored for Chinese painting evaluation. Our approach leverages a newly constructed HanMo-Bench dataset and expert-validated chain-of-thought (CoT) reasoning to enable a comprehensive inference pipeline, spanning content recognition, region-of-interest localization, thematic interpretation, and three-tiered professional assessment. Furthermore, we integrate reward-function optimization and test-time scaling mechanisms to enhance reasoning quality. Experimental results demonstrate that HanMoVLM achieves high alignment with human experts in professional evaluations and can serve as a high-quality verifier to significantly improve the artistic output of generative models.
π Abstract
While Large Vision-Language Models (VLMs) demonstrate impressive general visual capabilities, they remain artistically blind and unable to offer professional evaluation of artworks within specific artistic domains like human experts. To bridge this gap, we transform VLMs into experts capable of professional-grade painting evaluation in the Chinese Artistic Domain, which is more abstract and demands extensive artistic training for evaluation. We introduce HanMo-Bench, a new dataset that features authentic auction-grade masterpieces and AI-generated works, grounded in real-world market valuations. To realize the rigorous judgment, we propose the HanMoVLM and construct a Chain-of-Thought (CoT) validated by experts. This CoT guides the model to perform expert-level reasoning: from content identification and Region of Interest (RoI) localization to professional evaluation, guided by both theme-specific evaluation and typical three-tier evaluation in Chinese paintings. Furthermore, we design a reward function to refine the reasoning process of the HanMoVLM to improve the accuracy. We demonstrate that HanMoVLM can serve as a critical backbone for Test-time Scaling in image generation. By acting as a high-quality verifier, HanMoVLM enables generative models to select the most artistically superior outputs from multiple candidates. Experimental results and human studies confirm that the proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation.