Global-Local Aware Scene Text Editing

📅 2025-06-30
🏛️ IEEE International Conference on Multimedia and Expo
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scene Text Editing (STE) faces two key challenges: inconsistent texture alignment between edited regions and background, and geometric distortion arising from text-length variations. This paper proposes a global-local aware editing framework. Methodologically, it (1) decouples global context modeling from local detail synthesis to enable coherent style-texture fusion; (2) introduces scale-invariant text style vectors for resolution-agnostic style transfer; and (3) incorporates an affine fusion module to explicitly preserve the aspect ratio of target text. Joint adversarial loss, feature enhancement, and structured training further improve editing consistency and robustness. Evaluated on real-world scene text datasets, our method achieves state-of-the-art performance in PSNR, SSIM, and user studies—particularly excelling in long-text replacement and scaling-editing tasks.

Technology Category

Application Category

📝 Abstract
Scene Text Editing (STE) involves replacing text in a scene image with new target text while preserving both the original text style and background texture. Existing methods suffer from two major challenges: inconsistency and length-insensitivity. They often fail to maintain coherence between the edited local patch and the surrounding area, and they struggle to handle significant differences in text length before and after editing. To tackle these challenges, we propose an end-to-end framework called Global-Local Aware Scene Text Editing (GLASTE), which simultaneously incorporates high-level global contextual information along with delicate local features. Specifically, we design a global-local combination structure, joint global and local losses, and enhance text image features to ensure consistency in text style within local patches while maintaining harmony between local and global areas. Additionally, we express the text style as a vector independent of the image size, which can be transferred to target text images of various sizes. We use an affine fusion to fill target text images into the editing patch while maintaining their aspect ratio unchanged. Extensive experiments on real-world datasets validate that our GLASTE model outperforms previous methods in both quantitative metrics and qualitative results and effectively mitigates the two challenges.
Problem

Research questions and friction points this paper is trying to address.

Maintains text style and background consistency during editing
Handles varying text lengths before and after replacement
Ensures harmony between edited local patches and global context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global-local combination structure with joint losses
Text style vector transfer independent of image size
Affine fusion maintains aspect ratio during text insertion
🔎 Similar Papers
No similar papers found.
F
Fu-Yao Yang
Harbin Institute of Technology, Harbin, China
Tonghua Su
Tonghua Su
Professor of Harbin Institute of Technology
pattern recognitioncharacter recognitionmachine learningsoftware engineering
Donglin Di
Donglin Di
Li Auto Inc.
Generative ModelsEmbodied AIMedical ImageMultimedia
Yin Chen
Yin Chen
Lecturer in Mathematics at University of Saskatchewan
Invariant theoryLie theoryCommutative algebraApplied algebraic geometry
X
Xiangqian Wu
Harbin Institute of Technology, Harbin, China; Suzhou Research Institute, HIT, Suzhou, China
Z
Zhongjie Wang
Harbin Institute of Technology, Harbin, China
L
Lei Fan
University of New South Wales, Sydney, Australia