RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image (T2I) models struggle to accurately infer user intent from brief, ambiguous prompts, leading to semantic misalignment and poor compositional structure. To address this, we propose the first reasoning-augmented prompt rewriting framework, wherein a large language model (LLM) performs explicit semantic and compositional reasoning—guided by reinforcement learning—without relying on handcrafted rules or stylistic paraphrasing. Our method employs image-level, multi-dimensional rewards—namely human preference, semantic alignment, and visual composition—as supervisory signals, enabling end-to-end, annotation-free training and differentiable prompt optimization. Evaluated on GenEval and T2I-Compbench, our approach significantly improves spatial layout fidelity and compositional generalization, consistently outperforming prior state-of-the-art methods and establishing new SOTA performance.

Technology Category

Application Category

📝 Abstract
Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results.
Problem

Research questions and friction points this paper is trying to address.

Improving text-to-image generation from vague prompts
Enhancing prompts with visual semantics via reasoning
Optimizing image outcomes with reinforcement learning rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for prompt enhancement
Structured self-reflective prompt generation
Image-level outcome optimization via rewards
🔎 Similar Papers
No similar papers found.
Mingrui Wu
Mingrui Wu
XMU
MLLMT2I
L
Lu Wang
Microsoft
P
Pu Zhao
Microsoft
F
Fangkai Yang
Microsoft
Jianjin Zhang
Jianjin Zhang
Microsoft
Information RetrievalNatural Language ProcessingComputer Vision
J
Jianfeng Liu
Microsoft
Yuefeng Zhan
Yuefeng Zhan
Microsoft
NLPCV
W
Weihao Han
Microsoft
H
Hao Sun
Microsoft
Jiayi Ji
Jiayi Ji
Rutgers University
X
Xiaoshuai Sun
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Qingwei Lin
Qingwei Lin
Microsoft
Weiwei Deng
Weiwei Deng
Professor of Mechanical Engineering, Southern University of Science and Technology
electrosprayfluid dynamicsoptofluidics
Dongmei Zhang
Dongmei Zhang
Microsoft Research
Software EngineeringMachine LearningInformation Visualization
Feng Sun
Feng Sun
Unknown affiliation
Computational Geometry
Q
Qi Zhang
Microsoft
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China