🤖 AI Summary
To address the lack of multidimensional, interpretable, and educationally aligned methods for aesthetic assessment of children’s artwork, this paper introduces KidsArtBench—the first benchmark tailored for children aged 5–15. It comprises 1,000+ artworks, nine pedagogically grounded annotation dimensions, and expert qualitative feedback, supporting both ordinal scoring and formative assessment. Methodologically, we propose the first attribute-aware multi-LoRA architecture coupled with Regression-Aware Fine-Tuning (RAFT), which disentangles abstract aesthetics into independent, interpretable dimensions. Our approach integrates rubric-aligned supervision and expert-coordinated annotation. Evaluated on Qwen2.5-VL-7B, our method achieves a Spearman correlation of 0.653 (+0.185 over baselines), with particularly notable gains in perceptual dimensions and substantial reduction in performance gaps for higher-order aesthetic attributes. All data, code, and ethical documentation are publicly released.
📝 Abstract
Multimodal Large Language Models (MLLMs) show remarkable progress across many visual-language tasks; however, their capacity to evaluate artistic expression remains limited. Aesthetic concepts are inherently abstract and open-ended, and multimodal artwork annotations are scarce. We introduce KidsArtBench, a new benchmark of over 1k children's artworks (ages 5-15) annotated by 12 expert educators across 9 rubric-aligned dimensions, together with expert comments for feedback. Unlike prior aesthetic datasets that provide single scalar scores on adult imagery, KidsArtBench targets children's artwork and pairs multi-dimensional annotations with comment supervision to enable both ordinal assessment and formative feedback. Building on this resource, we propose an attribute-specific multi-LoRA approach, where each attribute corresponds to a distinct evaluation dimension (e.g., Realism, Imagination) in the scoring rubric, with Regression-Aware Fine-Tuning (RAFT) to align predictions with ordinal scales. On Qwen2.5-VL-7B, our method increases correlation from 0.468 to 0.653, with the largest gains on perceptual dimensions and narrowed gaps on higher-order attributes. These results show that educator-aligned supervision and attribute-aware training yield pedagogically meaningful evaluations and establish a rigorous testbed for sustained progress in educational AI. We release data and code with ethics documentation.