FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection

📅 2026-01-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the challenge of accurately rendering text in large-scale text-to-image diffusion models, particularly in scenarios involving multi-line layouts, dense textual content, and long-tail scripts such as Chinese. The authors propose a training-free, plug-and-play framework that leverages the intrinsic attention mechanisms of Diffusion Transformers (DiT) to spatially anchor and topologically refine text regions. Additionally, they introduce a Spectral Glyph Modulation Injection (SGMI) strategy, which injects glyph priors via frequency-domain bandpass modulation to suppress semantic leakage and enhance character structure fidelity. Evaluated on Qwen-Image, FLUX.1-dev, and SD3 across benchmarks including longText-Benchmark, CVTG, and CLT-Bench, the method significantly improves text readability while preserving semantic alignment and visual quality, with only minimal additional inference overhead.

Technology Category

Application Category

📝 Abstract

Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.

Problem

Research questions and friction points this paper is trying to address.

text rendering

diffusion models

multi-line layout

dense typography

long-tailed scripts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention Localization

Spectral Glyph Injection

Training-Free Text Rendering