Recognition-Synergistic Scene Text Editing

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scene text editing requires modifying textual content while preserving stylistic attributes—including font, lighting, and perspective—yet existing explicit disentanglement methods suffer from complex pipelines and poor generalization. This paper proposes a recognition-editing co-modeling paradigm: a multimodal parallel Transformer decoder jointly predicts the target text and edited image; coupled with cyclic self-supervised fine-tuning to achieve implicit style-content disentanglement—without requiring paired training data. Our method achieves state-of-the-art performance on both synthetic and real-world benchmarks. Moreover, the high-fidelity, challenging edited samples significantly enhance the robustness of downstream text recognition models. Key innovations include (i) joint modeling of text semantics and image generation, (ii) an implicit disentanglement mechanism that avoids explicit attribute decomposition, and (iii) a self-supervised optimization framework driven solely by unpaired data.

Technology Category

Application Category

📝 Abstract
Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, mymodel achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at https://github.com/ZhengyaoFang/RS-STE.
Problem

Research questions and friction points this paper is trying to address.

Modify text in scene images while preserving style consistency.
Simplify complex pipelines in traditional text editing methods.
Enhance style and content consistency using unpaired real-world data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework integrates text recognition and editing.
Transformer-based multi-modal parallel decoder predicts content and style.
Cyclic self-supervised fine-tuning enhances style and content consistency.
🔎 Similar Papers
No similar papers found.
Z
Zhengyao Fang
Harbin Institute of Technology, Shenzhen
Pengyuan Lyu
Pengyuan Lyu
Huazhong University of Science and Technology
computer vision
J
Jingjing Wu
Department of Computer Vision Technology, Baidu Inc.
Chengquan Zhang
Chengquan Zhang
Unknown affiliation
computer visionapplication of deep learning
J
Jun Yu
Harbin Institute of Technology, Shenzhen
Guangming Lu
Guangming Lu
Harbin Institute of Technology, Shenzhen
Computer VisionMachine Learning
W
Wenjie Pei
Harbin Institute of Technology, Shenzhen