đ€ AI Summary
This work addresses the domain gap between synthetic and real-world text images, which stems from insufficient diversity in existing synthetic data regarding content, fonts, and layout. To bridge this gap, the authors propose UnionST, a high-diversity synthetic engine that generates the UnionST-S dataset, better reflecting the complexity of real-world scenarios. They further introduce a Self-Evolving Learning (SEL) framework that enhances data realism and model generalization by jointly modeling challenging samples and leveraging rendering-based diverse text synthesis. Remarkably, models trained with only 9% of real annotated data, supplemented by UnionST-S and SEL, achieve performance on par with or superior to those trained on full real datasets, outperforming current synthetic and real-data-based methods in several challenging scenarios.
đ Abstract
Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.