DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current text-to-image models suffer from textual distortion when generating images from long or multi-sentence prompts, primarily due to global attention dilution. To address this, we propose a training-free divide-and-conquer diffusion framework built upon multimodal diffusion Transformers—introducing no additional parameters or fine-tuning. Our approach features: (1) a dual attention masking mechanism—Text-Focus and Context-Expansion—that enables precise localization of critical text regions while jointly modeling contextual semantics; and (2) hierarchical prompt decomposition coupled with local noise initialization, ensuring character-level alignment and global image coherence. Experiments demonstrate state-of-the-art text accuracy on both single-sentence and multi-sentence generation benchmarks, with preserved image fidelity and significantly reduced inference latency.

Technology Category

Application Category

📝 Abstract

Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.

Problem

Research questions and friction points this paper is trying to address.

Addresses diluted global attention in text-to-image models for long or multiple texts.

Improves text accuracy and region alignment without increasing computational cost.

Enhances visual text generation by preserving image coherence and reducing generation latency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Divide-and-conquer strategy for text decomposition

Sequential attention masks for regional text rendering

Localized noise initialization for accuracy and alignment

🔎 Similar Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding