Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

In real-world text image super-resolution, diffusion models suffer from low text fidelity and struggle to simultaneously preserve visual realism and structural accuracy. To address this, we propose a progressive diffusion framework that integrates pretrained super-resolution priors with confidence-weighted cross-attention. Our key contributions are: (1) a progressive multi-stage sampling strategy to mitigate degradation caused by low-quality low-resolution inputs; (2) an SR-prior-guided U-Net architecture that enhances high-frequency text structure modeling; and (3) a dynamic cross-attention module weighted by optical character recognition (OCR) confidence scores, explicitly improving character-level fidelity. Evaluated on real-world text image datasets, our method achieves significant gains in structural accuracy (+12.7%) and perceptual quality, while demonstrating superior generalization and robustness compared to state-of-the-art diffusion-based and GAN-based baselines.

Technology Category

Application Category

📝 Abstract

Restoring low-resolution text images presents a significant challenge, as it requires maintaining both the fidelity and stylistic realism of the text in restored images. Existing text image restoration methods often fall short in hard situations, as the traditional super-resolution models cannot guarantee clarity, while diffusion-based methods fail to maintain fidelity. In this paper, we introduce a novel framework aimed at improving the generalization ability of diffusion models for text image super-resolution (SR), especially promoting fidelity. First, we propose a progressive data sampling strategy that incorporates diverse image types at different stages of training, stabilizing the convergence and improving the generalization. For the network architecture, we leverage a pre-trained SR prior to provide robust spatial reasoning capabilities, enhancing the model's ability to preserve textual information. Additionally, we employ a cross-attention mechanism to better integrate textual priors. To further reduce errors in textual priors, we utilize confidence scores to dynamically adjust the importance of textual features during training. Extensive experiments on real-world datasets demonstrate that our approach not only produces text images with more realistic visual appearances but also improves the accuracy of text structure.

Problem

Research questions and friction points this paper is trying to address.

Enhancing generalization of diffusion models for text image super-resolution.

Improving fidelity and realism in restored low-resolution text images.

Addressing challenges in clarity and accuracy of text structure restoration.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive data sampling for diverse image types

Pre-trained SR prior for spatial reasoning

Cross-attention mechanism for textual priors

🔎 Similar Papers

No similar papers found.