Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Generative text-to-image models suffer from poor prompt adherence due to noisy, structurally incomplete training data. Method: This paper enhances controllability and text–image alignment via structured image descriptions, formulated using a unified four-element template (subject, scene, aesthetics, camera). We construct a high-quality dataset of 19 million text–image pairs; structured captions are generated using an LLaVA-Next model guided by Mistral-7B-Instruct. The resulting data is used to fine-tune PixArt-Σ and Stable Diffusion 2. Alignment performance is quantitatively evaluated using a vision-language VQA model. Contribution/Results: Structured descriptions yield a significant +4.2-point improvement in alignment scores over baseline models—outperforming random shuffling of structural elements—and substantially reduce reliance on manual prompt engineering. Our approach establishes a novel paradigm for controllable, semantically grounded image generation.

Technology Category

Application Category

📝 Abstract

We argue that generative text-to-image models often struggle with prompt adherence due to the noisy and unstructured nature of large-scale datasets like LAION-5B. This forces users to rely heavily on prompt engineering to elicit desirable outputs. In this work, we propose that enforcing a consistent caption structure during training can significantly improve model controllability and alignment. We introduce Re-LAION-Caption 19M, a high-quality subset of Re-LAION-5B, comprising 19 million 1024x1024 images with captions generated by a Mistral 7B Instruct-based LLaVA-Next model. Each caption follows a four-part template: subject, setting, aesthetics, and camera details. We fine-tune PixArt-$Σ$ and Stable Diffusion 2 using both structured and randomly shuffled captions, and show that structured versions consistently yield higher text-image alignment scores using visual question answering (VQA) models. The dataset is publicly available at https://huggingface.co/datasets/supermodelresearch/Re-LAION-Caption19M.

Problem

Research questions and friction points this paper is trying to address.

Improving prompt adherence in text-to-image models

Reducing reliance on prompt engineering for better outputs

Enhancing model controllability with structured captions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured captions enhance text-to-image alignment

Four-part template improves model controllability

Fine-tuning with structured captions boosts performance

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs