MDiff4STR: Mask Diffusion Model for Scene Text Recognition

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low recognition accuracy, inefficient inference, and overconfident predictions of Masked Diffusion Models (MDMs) in Scene Text Recognition (STR), this paper introduces MDMs to STR for the first time and proposes a training-inference consistency optimization framework. Our key contributions are: (1) six novel noise strategies designed to align training and inference distributions; (2) a token-replacement noise mechanism explicitly mitigating model overconfidence; and (3) a customized noise scheduling scheme enabling highly efficient inference with only three denoising steps. Experiments demonstrate that our method consistently outperforms state-of-the-art autoregressive models across multiple standard and challenging STR benchmarks. Remarkably, it achieves significant accuracy gains while maintaining millisecond-level inference latency—striking an unprecedented balance between precision and speed in diffusion-based STR.

Technology Category

Application Category

📝 Abstract
Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a token-replacement noise mechanism that provides a non-mask noise type, encouraging the model to reconsider and revise overly confident but incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy, while maintaining fast inference with only three denoising steps. Code: https://github.com/Topdu/OpenOCR.
Problem

Research questions and friction points this paper is trying to address.

Improves scene text recognition accuracy with diffusion models
Addresses training-inference noising gap in text recognition
Reduces overconfident predictions during inference in STR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Mask Diffusion Models to Scene Text Recognition
Develops six noising strategies to align training with inference
Proposes token-replacement noise to revise overconfident predictions
🔎 Similar Papers
No similar papers found.
Yongkun Du
Yongkun Du
复旦大学
Computer VisionOCR
M
Miaomiao Zhao
School of Computer Science and Technology, Beijing Jiaotong University, China
S
Songlin Fan
Institute of Trustworthy Embodied AI, Fudan University, China
Zhineng Chen
Zhineng Chen
Institute of Trustworthy Embodied AI, Fudan University
Computer VisionOCRMultimedia Analysis
C
Caiyan Jia
School of Computer Science and Technology, Beijing Jiaotong University, China
Yu-Gang Jiang
Yu-Gang Jiang
Professor, Fudan University. IEEE & IAPR Fellow
Video AnalysisEmbodied AITrustworthy AI