🤖 AI Summary
Current text-to-image models suffer from textual distortion when generating images from long or multi-sentence prompts, primarily due to global attention dilution. To address this, we propose a training-free divide-and-conquer diffusion framework built upon multimodal diffusion Transformers—introducing no additional parameters or fine-tuning. Our approach features: (1) a dual attention masking mechanism—Text-Focus and Context-Expansion—that enables precise localization of critical text regions while jointly modeling contextual semantics; and (2) hierarchical prompt decomposition coupled with local noise initialization, ensuring character-level alignment and global image coherence. Experiments demonstrate state-of-the-art text accuracy on both single-sentence and multi-sentence generation benchmarks, with preserved image fidelity and significantly reduced inference latency.
📝 Abstract
Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.