OTR: Synthesizing Overlay Text Dataset for Text Removal

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text removal datasets (e.g., SCUT-EnsText) suffer from three critical limitations: ground-truth artifacts introduced by manual editing, overly simplistic backgrounds, and narrow evaluation metrics—severely hindering cross-domain generalization and objective model assessment. To address these issues, we propose a synthesis paradigm tailored for complex scenes, integrating object-aware layout planning with vision-language model (VLM)-driven content generation to jointly produce photorealistic text overlays and artifact-free clean ground truth. Based on this framework, we construct and publicly release OTR—a first-of-its-kind large-scale, highly diverse benchmark featuring text superimposed over complex, realistic backgrounds across multi-scale, multi-occlusion, and multi-semantic-context scenarios. Experiments demonstrate that models trained on OTR achieve PSNR/SSIM improvements of over 2.1 dB / 0.03 on real-world images, significantly enhancing generalization capability and reconstruction fidelity—establishing a robust data foundation for privacy-preserving text removal and intelligent image editing.

Technology Category

Application Category

📝 Abstract
Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR .
Problem

Research questions and friction points this paper is trying to address.

Addressing dataset limitations in text removal generalization
Synthesizing benchmark for non-scene text removal applications
Providing clean ground truth with complex background scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes overlay text dataset for text removal
Uses object-aware placement for complex backgrounds
Leverages vision-language models to generate content
🔎 Similar Papers
No similar papers found.
J
Jan Zdenek
CyberAgent, Tokyo, Japan
W
Wataru Shimoda
CyberAgent, Tokyo, Japan
Kota Yamaguchi
Kota Yamaguchi
CyberAgent
Computer VisionMachine Learning