Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training

📅 2023-12-23
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models still exhibit significant limitations in accurately following natural language instructions—particularly those specifying spatial relationships among objects. To address this, we propose Iterative Prompt Re-annotation (IPR): a method that identifies image-text mismatches from generated samples, then dynamically refines textual prompts using cross-modal matching scores and classifier-based feedback. Crucially, IPR integrates this iterative optimization directly into the diffusion training pipeline without requiring reinforcement learning (RL). This marks the first RL-free approach to instruction alignment, avoiding the high variance and training instability inherent in RL-based methods. We validate IPR on Stable Diffusion v2 and SDXL architectures, achieving a 15.22% absolute improvement on the spatial-relation benchmark VISOR—substantially outperforming existing RL baselines. The implementation is publicly available.
📝 Abstract
Diffusion models have shown impressive performance in many domains. However, the model's capability to follow natural language instructions (e.g., spatial relationships between objects, generating complex scenes) is still unsatisfactory. In this work, we propose Iterative Prompt Relabeling (IPR), a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback. IPR first samples a batch of images conditioned on the text, then relabels the text prompts of unmatched text-image pairs with classifier feedback. We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations. With IPR, we improved up to 15.22% (absolute improvement) on the challenging spatial relation VISOR benchmark, demonstrating superior performance compared to previous RL methods. Our code is publicly available at https://github.com/xinyan-cxy/IPR-RLDF.
Problem

Research questions and friction points this paper is trying to address.

Improves text-to-image alignment in diffusion models
Enhances spatial relationship understanding in generated images
Addresses unsatisfactory natural language instruction following
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative Prompt Relabeling (IPR) algorithm
Aligns images to text iteratively
Improves spatial relation accuracy significantly
🔎 Similar Papers
No similar papers found.
X
Xinyan Chen
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; University of Science and Technology of China
Jiaxin Ge
Jiaxin Ge
UC Berkeley
Natural Language ProcessingComputer VisionGenerative AIMulti-Modality
Tianjun Zhang
Tianjun Zhang
University of California, Berkeley
Reinforcement LearningMachine LearningArtificial Intelligence
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models