TextSculptor: Training and Benchmarking Scene Text Editing

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the challenge of scene text editing, which requires precise modification of textual content while preserving visual realism and leaving non-target regions unchanged—a task hindered by the scarcity of high-quality training data and the absence of standardized evaluation benchmarks. To overcome these limitations, the authors propose TextSculptor, a novel framework that introduces an automated data generation approach combining text-aware image synthesis with procedural rendering. This methodology yields TextSculpt-Data, a large-scale dataset comprising 3.2 million samples, and establishes TextSculpt-Bench, the first comprehensive benchmark dedicated to scene text editing. Experiments demonstrate that TextSculptor substantially enhances the performance of open-source models, narrowing the gap with proprietary systems, and fosters community progress through the public release of data and tools.

📝 Abstract

Recent advances in Multimodal Large Language Models (MLLMs) and diffusion-based generative models have substantially improved prompt-driven image editing. However, scene text editing remains challenging, as it requires models to precisely modify textual content while preserving visual realism and non-target regions. Current open-source models still lag behind proprietary systems, largely due to the scarcity of high-quality training data and the lack of standardized benchmarks tailored to text editing. To address these challenges, we present TextSculptor, a comprehensive framework for data construction and evaluation of scene text editing. We first develop an automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing. Based on this pipeline, we build TextSculpt-Data, a large-scale dataset containing 3.2M training samples, including 1.2M OCR-verified text-to-image samples and 2M paired text editing samples with naturally aligned source-target images and strong background consistency. We further introduce TextSculpt-Bench, a benchmark covering four fundamental text editing tasks: text addition, text replacement, text removal, and hybrid editing. To support reliable evaluation, we design a tailored protocol that measures text accuracy, visual quality, and background preservation through OCR-based text alignment, multimodal judgment, and background-region similarity. Extensive experiments show that TextSculptor improves open-source text editing performance and narrows the gap to proprietary models. The data and benchmark are available at https://github.com/linyiheng123/TextSculptor.

Problem

Research questions and friction points this paper is trying to address.

scene text editing

training data scarcity

benchmarking

visual realism

background preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

scene text editing

data synthesis

multimodal benchmark