Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment

πŸ“… 2025-12-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
AIGC art image aesthetic assessment faces challenges in jointly modeling visual perception, cognition, and emotion; suffers from scarce, imbalanced, multi-dimensional annotations; and struggles with long textual descriptions. Method: We introduce RAD, the first 70K-scale, multi-dimensional, structured aesthetic description dataset, and propose ArtQuantβ€”a large language model (LLM)-driven framework. Contribution/Results: (1) We design an iterative lightweight annotation pipeline to generate semantically rich, hierarchical aesthetic descriptions; (2) We replace multi-branch encoders and contrastive learning with unified descriptive generation for cross-dimensional collaborative modeling; (3) We theoretically prove this paradigm minimizes prediction entropy, enhancing alignment with human aesthetic judgments. Experiments show ArtQuant achieves state-of-the-art performance across multiple benchmarks, triples training efficiency over conventional methods, and significantly narrows the cognitive gap between model outputs and human aesthetic evaluations.

Technology Category

Application Category

πŸ“ Abstract
The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD's semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.
Problem

Research questions and friction points this paper is trying to address.

Addresses data scarcity in aesthetic quality assessment
Resolves model fragmentation for processing aesthetic attributes
Narrows cognitive gap between images and aesthetic judgment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical description learning for artistic image aesthetics assessment
Refined Aesthetic Description dataset generated via iterative pipeline
ArtQuant framework couples aesthetic dimensions with LLM decoders
πŸ”Ž Similar Papers