🤖 AI Summary
To address the coarse-grained prompt control and weak local controllability of DiT models in region-level image generation, this paper proposes a progressive coarse-to-fine prompting paradigm. First, an LLM parses the input prompt to generate multi-level semantic descriptions. Then, leveraging a newly discovered semantic division of labor across DiT’s cross-attention layers—where shallow layers govern spatial localization and deep layers encode high-level semantics—we design a region-aware, hierarchical prompt injection mechanism that decouples content and style control. The method establishes an end-to-end hierarchical control pipeline integrating DiT, LLM-based prompt parsing, and layered cross-attention. Experiments demonstrate consistent improvements across multiple benchmarks: FID decreases by 12.3%, CLIP-Score increases by 9.7%, and human evaluation shows significant gains in object localization accuracy, detail fidelity, and style consistency over state-of-the-art methods.
📝 Abstract
The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.