GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe distortion and poor legibility of complex scripts—particularly Chinese—in scene text editing, this paper proposes a high-fidelity text editing method. Our approach operates within a latent diffusion framework and integrates three key components: (1) a novel glyph attention module that explicitly models hierarchical structural relationships across stroke-, character-, and text-line levels; (2) a multi-scale OCR feature pyramid to enable fine-grained glyph-aware guidance; and (3) a custom glyph encoder jointly optimized with glyph attention and OCR feature fusion, thereby incorporating both stroke-structural priors and global semantic constraints. Experiments demonstrate substantial improvements over multilingual state-of-the-art methods: +18.02% sentence accuracy and −53.28% FID on text regions, confirming significant gains in character legibility and visual consistency.

Technology Category

Application Category

📝 Abstract
Scene text editing, a subfield of image editing, requires modifying texts in images while preserving style consistency and visual coherence with the surrounding environment. While diffusion-based methods have shown promise in text generation, they still struggle to produce high-quality results. These methods often generate distorted or unrecognizable characters, particularly when dealing with complex characters like Chinese. In such systems, characters are composed of intricate stroke patterns and spatial relationships that must be precisely maintained. We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision. Our key insight is that existing methods, despite using pretrained OCR models for feature extraction, fail to capture the hierarchical nature of text structures - from individual strokes to stroke-level interactions to overall character-level structure. To address this, our glyph encoder explicitly models and captures the cross-level interactions between local-level individual characters and global-level text lines through our novel glyph attention module. Meanwhile, our model implements a feature pyramid network to fuse the multi-scale OCR backbone features at the global-level. Through these cross-level and multi-scale fusions, we obtain more detailed glyph-aware guidance, enabling precise control over the scene text generation process. Our method achieves an 18.02% improvement in sentence accuracy over the state-of-the-art multi-lingual scene text editing baseline, while simultaneously reducing the text-region Fr'echet inception distance by 53.28%.
Problem

Research questions and friction points this paper is trying to address.

Improving scene text editing quality with stroke-level precision
Addressing distortion in complex characters like Chinese
Enhancing style consistency and visual coherence in text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Glyph encoder guides diffusion model precisely
Glyph attention module captures cross-level interactions
Feature pyramid network fuses multi-scale OCR features
🔎 Similar Papers
No similar papers found.
T
Tong Wang
MT Lab, Meitu Inc., Beijing 100083, China
T
Ting Liu
MT Lab, Meitu Inc., Beijing 100083, China
X
Xiaochao Qu
MT Lab, Meitu Inc., Beijing 100083, China
C
Chengjing Wu
MT Lab, Meitu Inc., Beijing 100083, China
Luoqi Liu
Luoqi Liu
Director of MT Lab; Meitu
Computer Vision
X
Xiaolin Hu
Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing 100084, China