🤖 AI Summary
This work addresses the challenge of accurately rendering text in large-scale text-to-image diffusion models, particularly in scenarios involving multi-line layouts, dense textual content, and long-tail scripts such as Chinese. The authors propose a training-free, plug-and-play framework that leverages the intrinsic attention mechanisms of Diffusion Transformers (DiT) to spatially anchor and topologically refine text regions. Additionally, they introduce a Spectral Glyph Modulation Injection (SGMI) strategy, which injects glyph priors via frequency-domain bandpass modulation to suppress semantic leakage and enhance character structure fidelity. Evaluated on Qwen-Image, FLUX.1-dev, and SD3 across benchmarks including longText-Benchmark, CVTG, and CLT-Bench, the method significantly improves text readability while preserving semantic alignment and visual quality, with only minimal additional inference overhead.
📝 Abstract
Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.