🤖 AI Summary
Generative text-to-image models suffer from poor prompt adherence due to noisy, structurally incomplete training data. Method: This paper enhances controllability and text–image alignment via structured image descriptions, formulated using a unified four-element template (subject, scene, aesthetics, camera). We construct a high-quality dataset of 19 million text–image pairs; structured captions are generated using an LLaVA-Next model guided by Mistral-7B-Instruct. The resulting data is used to fine-tune PixArt-Σ and Stable Diffusion 2. Alignment performance is quantitatively evaluated using a vision-language VQA model. Contribution/Results: Structured descriptions yield a significant +4.2-point improvement in alignment scores over baseline models—outperforming random shuffling of structural elements—and substantially reduce reliance on manual prompt engineering. Our approach establishes a novel paradigm for controllable, semantically grounded image generation.
📝 Abstract
We argue that generative text-to-image models often struggle with prompt adherence due to the noisy and unstructured nature of large-scale datasets like LAION-5B. This forces users to rely heavily on prompt engineering to elicit desirable outputs. In this work, we propose that enforcing a consistent caption structure during training can significantly improve model controllability and alignment. We introduce Re-LAION-Caption 19M, a high-quality subset of Re-LAION-5B, comprising 19 million 1024x1024 images with captions generated by a Mistral 7B Instruct-based LLaVA-Next model. Each caption follows a four-part template: subject, setting, aesthetics, and camera details. We fine-tune PixArt-$Σ$ and Stable Diffusion 2 using both structured and randomly shuffled captions, and show that structured versions consistently yield higher text-image alignment scores using visual question answering (VQA) models. The dataset is publicly available at https://huggingface.co/datasets/supermodelresearch/Re-LAION-Caption19M.