TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment

πŸ“… 2025-12-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Diffusion models commonly suffer from text omission in text-to-image generation, particularly exhibiting insufficient rendering completeness for Chinese text. To address this, we propose a training-free attention-alignment guidance mechanism. First, we uncover the distribution pattern of text-related tokens within the self-attention layers of MM-DiT. Second, we design a dual-loss latent-space guidance paradigm for early denoising stages, explicitly modeling correspondences between textual tokens and image text regions to achieve end-to-end, zero-training rendering correction. Our method integrates OCR-driven evaluation with a text–image region alignment loss. Experiments demonstrate state-of-the-art performance across text recall rate, OCR accuracy, and CLIP-Score, significantly improving both textual rendering completeness and fidelity.

Technology Category

Application Category

πŸ“ Abstract
Despite recent advances, diffusion-based text-to-image models still struggle with accurate text rendering. Several studies have proposed fine-tuning or training-free refinement methods for accurate text rendering. However, the critical issue of text omission, where the desired text is partially or entirely missing, remains largely overlooked. In this work, we propose TextGuider, a novel training-free method that encourages accurate and complete text appearance by aligning textual content tokens and text regions in the image. Specifically, we analyze attention patterns in MM-DiT models, particularly for text-related tokens intended to be rendered in the image. Leveraging this observation, we apply latent guidance during the early stage of denoising steps based on two loss functions that we introduce. Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
Problem

Research questions and friction points this paper is trying to address.

Addresses inaccurate text rendering in diffusion models
Focuses on solving text omission issues in generated images
Proposes training-free guidance for complete text appearance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free guidance via attention alignment
Aligning textual tokens with image regions
Early-stage latent guidance using novel loss functions
πŸ”Ž Similar Papers
No similar papers found.
K
Kanghyun Baek
Interdisciplinary Program in Artificial Intelligence, Seoul National University
S
Sangyub Lee
Interdisciplinary Program in Artificial Intelligence, Seoul National University
J
Jin Young Choi
Interdisciplinary Program in Artificial Intelligence, Seoul National University
J
Jaewoo Song
Department of Electrical and Computer Engineering, Seoul National University; Global Technology Research, Samsung Electronics
Daemin Park
Daemin Park
Department of Electrical and Computer Engineering, Seoul National University
Jooyoung Choi
Jooyoung Choi
Seoul National University
Deep Generative Models
Chaehun Shin
Chaehun Shin
Seoul National University
deep learninggenerative model
Bohyung Han
Bohyung Han
Professor, Electrical and Computer Engineering, Seoul National University
Computer visionmachine learningdeep learning
Sungroh Yoon
Sungroh Yoon
Professor, Electrical and Computer Engineering & Artificial Intelligence, Seoul National University
AIdeep learningmachine learningon-device AIbioinformatics