Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

This work addresses two critical challenges in scene text recognition (STR): severe label noise in training data and low efficiency in model scaling. We systematically demonstrate that decoder scaling exerts a dominant influence on performance gains—outperforming encoder scaling. To mitigate label noise, we propose a Cloze self-distillation mechanism that leverages soft predictions and refined pseudo-labels derived from the model’s own outputs. Furthermore, we design a differentiable cross-attention decoder tailored for STR to enhance feature alignment efficiency. Our approach operates entirely within an end-to-end STR framework and relies solely on real annotated data—no synthetic data or external supervision is used. Evaluated on 11 mainstream benchmarks, our method achieves state-of-the-art (SOTA) results on 10 of them, while significantly reducing both parameter count and computational cost. This establishes a new paradigm for efficient and robust STR modeling.

Technology Category

Application Category

📝 Abstract

Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.

Problem

Research questions and friction points this paper is trying to address.

Explores impact of scaling vision encoder and text decoder in Scene Text Recognition.

Addresses label noise in real-world data for improved STR model effectiveness.

Introduces Cloze Self-Distillation to mitigate label noise using context-aware predictions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling decoder enhances Scene Text Recognition performance

Cloze Self-Distillation reduces label noise impact

Differential cross-attention improves decoder architecture efficiency

🔎 Similar Papers

FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting