Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing CLIP/BLIP-based human preference reward models suffer from aesthetic bias—over-penalizing images with rich detail and high aesthetic quality—leading to substantial misalignment with actual human preferences. To address this, we propose a dual-track optimization framework: (1) an Image-Contains-Text (ICT) scoring module that explicitly models cross-modal alignment between image and prompt; and (2) a pure-image High-Preference (HP) scoring model trained exclusively on visual features to capture fine-grained fidelity and aesthetic quality. Jointly trained, these components transcend the limitations of conventional text-alignment–centric paradigms. Experiments demonstrate that our model improves human preference prediction accuracy by over 10% relative to prior methods, significantly enhancing aesthetic alignment in text-to-image generation. Moreover, it establishes a scalable, modality-aware evaluation foundation for higher-order visual aesthetic modeling.

Technology Category

Application Category

📝 Abstract

Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality while maintaining text-image alignment. Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical and empirical support for evolving image generation technology toward higher-order human aesthetic preferences. Code is available at https://github.com/BarretBa/ICTHP.

Problem

Research questions and friction points this paper is trying to address.

Existing reward models misalign with human aesthetic preferences

Current frameworks fail to evaluate high-detail, aesthetic images accurately

Need improved metrics for text-image alignment and image quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ICT score for text-image alignment

Trains HP score model for image aesthetics

Improves scoring accuracy by over 10%

🔎 Similar Papers

No similar papers found.