PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models frequently exhibit text–image misalignment when processing complex prompts involving attribute binding, negation, and compositional relationships. To address this, we propose a Chain-of-Thought (CoT) prompt rewriting framework that decouples the rewriting module from the generative model, enhancing intent alignment without modifying the original model’s weights. We introduce AlignEvaluator, a fine-grained reward model covering 24 critical semantic dimensions, systematically designed based on failure-mode analysis to guide reinforcement learning. Our method significantly improves text–image alignment on HunyuanImage 2.1. Furthermore, we release a high-quality human preference evaluation benchmark, empirically validating gains in both semantic precision and compositional generalization.

Technology Category

Application Category

📝 Abstract
Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine-tuning or implicit reward signals like image-reward scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.
Problem

Research questions and friction points this paper is trying to address.

Improving text-to-image model fidelity to complex prompts
Addressing attribute binding and compositional relationship failures
Reducing mismatch between user intent and generated output
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought prompt rewriting via reinforcement learning
Decoupled rewriter trained with dedicated AlignEvaluator reward
Systematic taxonomy of 24 failure modes for fine-grained feedback
🔎 Similar Papers
No similar papers found.
L
Linqing Wang
Tencent Hunyuan
X
Ximing Xing
Tencent Hunyuan
Yiji Cheng
Yiji Cheng
Tsinghua University
Computer VisionGenerative Models
Z
Zhiyuan Zhao
Tencent Hunyuan
Jiale Tao
Jiale Tao
Tencent; UESTC
computer visionimage animationvideo generationsemantic segmentation
Q
Qixun Wang
Tencent Hunyuan
Ruihuang Li
Ruihuang Li
The Hong Kong Polytechnic University (PolyU)
AIGCimage/video/3D generation/editing
X
Xin Li
Tencent Hunyuan
Mingrui Wu
Mingrui Wu
XMU
MLLMT2I
X
Xinchi Deng
Tencent Hunyuan
C
Chunyu Wang
Tencent Hunyuan
Q
Qinglin Lu
Tencent Hunyuan