Text to Image Generation and Editing: A Survey

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This survey addresses the lack of systematic, up-to-date overviews in text-to-image (T2I) generation. We conduct a comprehensive analysis of 141 representative works published between 2021 and 2024. Methodologically, we unify and categorize four foundational architectures—autoregressive, non-autoregressive, GAN-based, and diffusion models—and integrate emerging directions including Mamba, multimodal modeling, and energy-based models. We establish a multidimensional comparative framework covering generation/editing paradigms, evaluation metrics, training resource requirements, and inference efficiency. Our contributions include: (1) the first governance framework jointly addressing technical evolution and societal impact; (2) identification of performance-enhancing commonalities—e.g., classifier-free guidance and joint attention-encoder design; and (3) release of the most comprehensive reproducible benchmark and technology roadmap to date, providing structured guidance for future research.

Technology Category

Application Category

📝 Abstract

Text-to-image generation (T2I) refers to the text-guided generation of high-quality images. In the past few years, T2I has attracted widespread attention and numerous works have emerged. In this survey, we comprehensively review 141 works conducted from 2021 to 2024. First, we introduce four foundation model architectures of T2I (autoregression, non-autoregression, GAN and diffusion) and the commonly used key technologies (autoencoder, attention and classifier-free guidance). Secondly, we systematically compare the methods of these studies in two directions, T2I generation and T2I editing, including the encoders and the key technologies they use. In addition, we also compare the performance of these researches side by side in terms of datasets, evaluation metrics, training resources, and inference speed. In addition to the four foundation models, we survey other works on T2I, such as energy-based models and recent Mamba and multimodality. We also investigate the potential social impact of T2I and provide some solutions. Finally, we propose unique insights of improving the performance of T2I models and possible future development directions. In summary, this survey is the first systematic and comprehensive overview of T2I, aiming to provide a valuable guide for future researchers and stimulate continued progress in this field.

Problem

Research questions and friction points this paper is trying to address.

Review text-to-image generation and editing methods comprehensively

Compare performance across datasets, metrics, and technologies

Explore future directions and social impacts of T2I

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reviewing four foundation model architectures for T2I

Comparing T2I generation and editing methods systematically

Surveying additional T2I works beyond foundation models

🔎 Similar Papers

No similar papers found.