UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image generation still faces challenges in visual text rendering, including glyph blurriness, semantic drift, and weak stylistic control. Existing approaches rely on pre-rendered glyph images, compromising fidelity in font style and color, while multi-branch architectures incur high computational overhead and limit flexibility. This paper proposes a unified conditional diffusion framework grounded in pixel-level text segmentation masks. We introduce a bilingual text segmentation model, design a region-aware loss function, and incorporate an adaptive glyph-conditioning injection mechanism. Our method enables high-fidelity control over glyph identity, color, and spatial layout, significantly improving legibility of small-scale text and consistency in complex typographic arrangements. On the AnyText benchmark, our approach achieves state-of-the-art performance; it further demonstrates superior small-glyph rendering quality and structural preservation on two newly introduced benchmarks—GlyphMM and MiniText.

Technology Category

Application Category

📝 Abstract
Text-to-image generation has greatly advanced content creation, yet accurately rendering visual text remains a key challenge due to blurred glyphs, semantic drift, and limited style control. Existing methods often rely on pre-rendered glyph images as conditions, but these struggle to retain original font styles and color cues, necessitating complex multi-branch designs that increase model overhead and reduce flexibility. To address these issues, we propose a segmentation-guided framework that uses pixel-level visual text masks -- rich in glyph shape, color, and spatial detail -- as unified conditional inputs. Our method introduces two core components: (1) a fine-tuned bilingual segmentation model for precise text mask extraction, and (2) a streamlined diffusion model augmented with adaptive glyph conditioning and a region-specific loss to preserve textual fidelity in both content and style. Our approach achieves state-of-the-art performance on the AnyText benchmark, significantly surpassing prior methods in both Chinese and English settings. To enable more rigorous evaluation, we also introduce two new benchmarks: GlyphMM-benchmark for testing layout and glyph consistency in complex typesetting, and MiniText-benchmark for assessing generation quality in small-scale text regions. Experimental results show that our model outperforms existing methods by a large margin in both scenarios, particularly excelling at small text rendering and complex layout preservation, validating its strong generalization and deployment readiness.
Problem

Research questions and friction points this paper is trying to address.

Accurately render visual text with precise glyphs and styles
Overcome limitations of pre-rendered glyph conditions in text-to-image
Improve text fidelity in complex layouts and small-scale regions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segmentation-guided framework with pixel-level text masks
Fine-tuned bilingual model for precise mask extraction
Streamlined diffusion model with adaptive glyph conditioning
🔎 Similar Papers
No similar papers found.
Y
Yuanrui Wang
Tsinghua University, Baidu Inc.
Cong Han
Cong Han
Google, Columbia University
Audio and speechBrain-computer interface
Yafei Li
Yafei Li
Nanjing Normal University
computational nanoscience
Z
Zhipeng Jin
Baidu Inc.
X
Xiawei Li
Baidu Inc.
S
SiNan Du
Tsinghua University
Wen Tao
Wen Tao
Baidu Inc.
Y
Yi Yang
Baidu Inc.
S
Shuanglong Li
Baidu Inc.
C
Chun Yuan
Tsinghua University
Liu Lin
Liu Lin
Beijing jiaotong University
computer vision