🤖 AI Summary
Existing CLIP/BLIP-based human preference reward models suffer from aesthetic bias—over-penalizing images with rich detail and high aesthetic quality—leading to substantial misalignment with actual human preferences. To address this, we propose a dual-track optimization framework: (1) an Image-Contains-Text (ICT) scoring module that explicitly models cross-modal alignment between image and prompt; and (2) a pure-image High-Preference (HP) scoring model trained exclusively on visual features to capture fine-grained fidelity and aesthetic quality. Jointly trained, these components transcend the limitations of conventional text-alignment–centric paradigms. Experiments demonstrate that our model improves human preference prediction accuracy by over 10% relative to prior methods, significantly enhancing aesthetic alignment in text-to-image generation. Moreover, it establishes a scalable, modality-aware evaluation foundation for higher-order visual aesthetic modeling.
📝 Abstract
Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality while maintaining text-image alignment. Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical and empirical support for evolving image generation technology toward higher-order human aesthetic preferences. Code is available at https://github.com/BarretBa/ICTHP.