MDiff4STR: Mask Diffusion Model for Scene Text Recognition

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the low recognition accuracy, inefficient inference, and overconfident predictions of Masked Diffusion Models (MDMs) in Scene Text Recognition (STR), this paper introduces MDMs to STR for the first time and proposes a training-inference consistency optimization framework. Our key contributions are: (1) six novel noise strategies designed to align training and inference distributions; (2) a token-replacement noise mechanism explicitly mitigating model overconfidence; and (3) a customized noise scheduling scheme enabling highly efficient inference with only three denoising steps. Experiments demonstrate that our method consistently outperforms state-of-the-art autoregressive models across multiple standard and challenging STR benchmarks. Remarkably, it achieves significant accuracy gains while maintaining millisecond-level inference latency—striking an unprecedented balance between precision and speed in diffusion-based STR.

Technology Category

Application Category

📝 Abstract

Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a token-replacement noise mechanism that provides a non-mask noise type, encouraging the model to reconsider and revise overly confident but incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy, while maintaining fast inference with only three denoising steps. Code: https://github.com/Topdu/OpenOCR.

Problem

Research questions and friction points this paper is trying to address.

Improves scene text recognition accuracy with diffusion models

Addresses training-inference noising gap in text recognition

Reduces overconfident predictions during inference in STR

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Mask Diffusion Models to Scene Text Recognition

Develops six noising strategies to align training with inference

Proposes token-replacement noise to revise overconfident predictions

🔎 Similar Papers

No similar papers found.