TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multilingual scene text image super-resolution (STISR), diffusion models suffer from inaccurate text localization and weak character-shape prior modeling, leading to text hallucination, structural distortion, and degraded readability. To address these issues, this paper proposes an OCR-guided cross-modal character–shape prior modeling framework. We introduce a novel dual-mechanism robust architecture that jointly integrates a multilingual OCR detector, a UTF-8 text encoder, and vision–language cross-attention, enabling precise text region localization and structure-faithful reconstruction. Evaluated on TextZoom and TextVQA benchmarks, our method significantly outperforms state-of-the-art approaches: text recognition accuracy improves by 12.6%, while PSNR, SSIM, and LPIPS scores all achieve new best results. This work establishes a new benchmark for STISR, particularly in challenging multilingual settings.

Technology Category

Application Category

📝 Abstract
While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach.
Problem

Research questions and friction points this paper is trying to address.

Improving text region localization in super-resolution
Enhancing multilingual character-to-shape modeling
Reducing hallucinated textures in super-resolved text
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal diffusion model for multilingual text SR
OCR-guided text region localization and enhancement
UTF-8 text encoder with cross-attention priors
🔎 Similar Papers
No similar papers found.
Keren Ye
Keren Ye
Google
Vision and LanguageObject DetectionScene Graph GenerationDiffusion Models
I
Ignacio Garcia Dorado
Google
M
Michalis Raptis
Google
M
M. Delbracio
Google
I
Irene Zhu
Google
P
P. Milanfar
Google
Hossein Talebi
Hossein Talebi
Google
Machine LearningComputer VisionComputational Photography