How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This study investigates how synthetic training caption design influences text-to-image model performance. Through controlled experiments, we quantitatively analyze the effects of caption density, length distribution, and quality on text-image alignment, visual aesthetic quality, and generation diversity—using CLIPScore, DINO-based diversity metrics, and human evaluation across BLIP-2, Qwen-VL, and FLUX-generated benchmark datasets. Our key contributions are threefold: first, we empirically demonstrate that synthetic caption distributions significantly amplify inherent model biases; second, we identify that randomly sampled caption lengths achieve an optimal trade-off among alignment, aesthetics, and diversity—outperforming high-density, high-quality captions in overall performance; third, we establish caption design as a highly controllable, critical data-level lever for steering model behavior. These findings underscore the importance of deliberate caption curation in training pipelines for text-to-image synthesis.

Technology Category

Application Category

📝 Abstract

Training data is at the core of any successful text-to-image models. The quality and descriptiveness of image text are crucial to a model's performance. Given the noisiness and inconsistency in web-scraped datasets, recent works shifted towards synthetic training captions. While this setup is generally believed to produce more capable models, current literature does not provide any insights into its design choices. This study closes this gap by systematically investigating how different synthetic captioning strategies impact the downstream performance of text-to-image models. Our experiments demonstrate that dense, high-quality captions enhance text alignment but may introduce trade-offs in output aesthetics and diversity. Conversely, captions of randomized lengths yield balanced improvements across aesthetics and alignment without compromising sample diversity. We also demonstrate that varying caption distributions introduce significant shifts in the output bias of a trained model. Our findings underscore the importance of caption design in achieving optimal model performance and provide practical insights for more effective training data strategies in text-to-image generation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating synthetic caption design choices for text-to-image models

Impact of caption strategies on model alignment and aesthetics

Exploring caption distribution effects on output bias and diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense high-quality captions enhance text alignment

Randomized caption lengths balance aesthetics and alignment

Varying caption distributions affect output bias significantly

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis