DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image models suffer from textual distortion when generating images from long or multi-sentence prompts, primarily due to global attention dilution. To address this, we propose a training-free divide-and-conquer diffusion framework built upon multimodal diffusion Transformers—introducing no additional parameters or fine-tuning. Our approach features: (1) a dual attention masking mechanism—Text-Focus and Context-Expansion—that enables precise localization of critical text regions while jointly modeling contextual semantics; and (2) hierarchical prompt decomposition coupled with local noise initialization, ensuring character-level alignment and global image coherence. Experiments demonstrate state-of-the-art text accuracy on both single-sentence and multi-sentence generation benchmarks, with preserved image fidelity and significantly reduced inference latency.

Technology Category

Application Category

📝 Abstract
Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.
Problem

Research questions and friction points this paper is trying to address.

Addresses diluted global attention in text-to-image models for long or multiple texts.
Improves text accuracy and region alignment without increasing computational cost.
Enhances visual text generation by preserving image coherence and reducing generation latency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Divide-and-conquer strategy for text decomposition
Sequential attention masks for regional text rendering
Localized noise initialization for accuracy and alignment
🔎 Similar Papers
No similar papers found.
J
Jaewoo Song
Department of Electrical and Computer Engineering, Seoul National University
Jooyoung Choi
Jooyoung Choi
Seoul National University
Deep Generative Models
K
Kanghyun Baek
IPAI, AIIS, ASRI, INMC, ISRC, Seoul National University
S
Sangyub Lee
IPAI, AIIS, ASRI, INMC, ISRC, Seoul National University
Daemin Park
Daemin Park
Department of Electrical and Computer Engineering, Seoul National University
Sungroh Yoon
Sungroh Yoon
Professor, Electrical and Computer Engineering & Artificial Intelligence, Seoul National University
AIdeep learningmachine learningon-device AIbioinformatics