Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing CLIP/BLIP-based human preference reward models suffer from aesthetic bias—over-penalizing images with rich detail and high aesthetic quality—leading to substantial misalignment with actual human preferences. To address this, we propose a dual-track optimization framework: (1) an Image-Contains-Text (ICT) scoring module that explicitly models cross-modal alignment between image and prompt; and (2) a pure-image High-Preference (HP) scoring model trained exclusively on visual features to capture fine-grained fidelity and aesthetic quality. Jointly trained, these components transcend the limitations of conventional text-alignment–centric paradigms. Experiments demonstrate that our model improves human preference prediction accuracy by over 10% relative to prior methods, significantly enhancing aesthetic alignment in text-to-image generation. Moreover, it establishes a scalable, modality-aware evaluation foundation for higher-order visual aesthetic modeling.

Technology Category

Application Category

📝 Abstract
Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality while maintaining text-image alignment. Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical and empirical support for evolving image generation technology toward higher-order human aesthetic preferences. Code is available at https://github.com/BarretBa/ICTHP.
Problem

Research questions and friction points this paper is trying to address.

Existing reward models misalign with human aesthetic preferences
Current frameworks fail to evaluate high-detail, aesthetic images accurately
Need improved metrics for text-image alignment and image quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ICT score for text-image alignment
Trains HP score model for image aesthetics
Improves scoring accuracy by over 10%
🔎 Similar Papers
No similar papers found.
Y
Ying Ba
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE
T
Tianyu Zhang
iN2X
Y
Yalong Bai
iN2X
Wenyi Mo
Wenyi Mo
Rutgers University
Deep LearningVision-Language ModelGenerative Model
T
Tao Liang
iN2X
B
Bing Su
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE
Ji-Rong Wen
Ji-Rong Wen
Gaoling School of Artificial Intelligence, Renmin University of China
Large Language ModelWeb SearchInformation RetrievalMachine Learning