π€ AI Summary
This work addresses the challenge of scene text editing, which requires precise modification of textual content while preserving visual realism and leaving non-target regions unchangedβa task hindered by the scarcity of high-quality training data and the absence of standardized evaluation benchmarks. To overcome these limitations, the authors propose TextSculptor, a novel framework that introduces an automated data generation approach combining text-aware image synthesis with procedural rendering. This methodology yields TextSculpt-Data, a large-scale dataset comprising 3.2 million samples, and establishes TextSculpt-Bench, the first comprehensive benchmark dedicated to scene text editing. Experiments demonstrate that TextSculptor substantially enhances the performance of open-source models, narrowing the gap with proprietary systems, and fosters community progress through the public release of data and tools.
π Abstract
Recent advances in Multimodal Large Language Models (MLLMs) and diffusion-based generative models have substantially improved prompt-driven image editing. However, scene text editing remains challenging, as it requires models to precisely modify textual content while preserving visual realism and non-target regions. Current open-source models still lag behind proprietary systems, largely due to the scarcity of high-quality training data and the lack of standardized benchmarks tailored to text editing. To address these challenges, we present TextSculptor, a comprehensive framework for data construction and evaluation of scene text editing. We first develop an automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing. Based on this pipeline, we build TextSculpt-Data, a large-scale dataset containing 3.2M training samples, including 1.2M OCR-verified text-to-image samples and 2M paired text editing samples with naturally aligned source-target images and strong background consistency. We further introduce TextSculpt-Bench, a benchmark covering four fundamental text editing tasks: text addition, text replacement, text removal, and hybrid editing. To support reliable evaluation, we design a tailored protocol that measures text accuracy, visual quality, and background preservation through OCR-based text alignment, multimodal judgment, and background-region similarity. Extensive experiments show that TextSculptor improves open-source text editing performance and narrows the gap to proprietary models. The data and benchmark are available at https://github.com/linyiheng123/TextSculptor.