What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the domain gap between synthetic and real-world text images, which stems from insufficient diversity in existing synthetic data regarding content, fonts, and layout. To bridge this gap, the authors propose UnionST, a high-diversity synthetic engine that generates the UnionST-S dataset, better reflecting the complexity of real-world scenarios. They further introduce a Self-Evolving Learning (SEL) framework that enhances data realism and model generalization by jointly modeling challenging samples and leveraging rendering-based diverse text synthesis. Remarkably, models trained with only 9% of real annotated data, supplemented by UnionST-S and SEL, achieve performance on par with or superior to those trained on full real datasets, outperforming current synthetic and real-data-based methods in several challenging scenarios.

Technology Category

Application Category

📝 Abstract

Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.

Problem

Research questions and friction points this paper is trying to address.

synthetic data

scene text recognition

domain gap

data diversity

realism

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data

scene text recognition

self-evolution learning