Enhancing Image Generation Fidelity via Progressive Prompts

📅 2025-01-13

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the coarse-grained prompt control and weak local controllability of DiT models in region-level image generation, this paper proposes a progressive coarse-to-fine prompting paradigm. First, an LLM parses the input prompt to generate multi-level semantic descriptions. Then, leveraging a newly discovered semantic division of labor across DiT’s cross-attention layers—where shallow layers govern spatial localization and deep layers encode high-level semantics—we design a region-aware, hierarchical prompt injection mechanism that decouples content and style control. The method establishes an end-to-end hierarchical control pipeline integrating DiT, LLM-based prompt parsing, and layered cross-attention. Experiments demonstrate consistent improvements across multiple benchmarks: FID decreases by 12.3%, CLIP-Score increases by 9.7%, and human evaluation shows significant gains in object localization accuracy, detail fidelity, and style consistency over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.

Problem

Research questions and friction points this paper is trying to address.

Image Generation

Fine-grained Control

Region-specific

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer

Gradual Control

Image Quality Enhancement

🔎 Similar Papers

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining