PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Text-to-image diffusion models frequently exhibit text–image misalignment when processing complex prompts involving attribute binding, negation, and compositional relationships. To address this, we propose a Chain-of-Thought (CoT) prompt rewriting framework that decouples the rewriting module from the generative model, enhancing intent alignment without modifying the original model’s weights. We introduce AlignEvaluator, a fine-grained reward model covering 24 critical semantic dimensions, systematically designed based on failure-mode analysis to guide reinforcement learning. Our method significantly improves text–image alignment on HunyuanImage 2.1. Furthermore, we release a high-quality human preference evaluation benchmark, empirically validating gains in both semantic precision and compositional generalization.

Technology Category

Application Category

📝 Abstract

Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine-tuning or implicit reward signals like image-reward scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.

Problem

Research questions and friction points this paper is trying to address.

Improving text-to-image model fidelity to complex prompts

Addressing attribute binding and compositional relationship failures

Reducing mismatch between user intent and generated output

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought prompt rewriting via reinforcement learning

Decoupled rewriter trained with dedicated AlignEvaluator reward

Systematic taxonomy of 24 failure modes for fine-grained feedback

🔎 Similar Papers

No similar papers found.