🤖 AI Summary
In real-world text image super-resolution, diffusion models suffer from low text fidelity and struggle to simultaneously preserve visual realism and structural accuracy. To address this, we propose a progressive diffusion framework that integrates pretrained super-resolution priors with confidence-weighted cross-attention. Our key contributions are: (1) a progressive multi-stage sampling strategy to mitigate degradation caused by low-quality low-resolution inputs; (2) an SR-prior-guided U-Net architecture that enhances high-frequency text structure modeling; and (3) a dynamic cross-attention module weighted by optical character recognition (OCR) confidence scores, explicitly improving character-level fidelity. Evaluated on real-world text image datasets, our method achieves significant gains in structural accuracy (+12.7%) and perceptual quality, while demonstrating superior generalization and robustness compared to state-of-the-art diffusion-based and GAN-based baselines.
📝 Abstract
Restoring low-resolution text images presents a significant challenge, as it requires maintaining both the fidelity and stylistic realism of the text in restored images. Existing text image restoration methods often fall short in hard situations, as the traditional super-resolution models cannot guarantee clarity, while diffusion-based methods fail to maintain fidelity. In this paper, we introduce a novel framework aimed at improving the generalization ability of diffusion models for text image super-resolution (SR), especially promoting fidelity. First, we propose a progressive data sampling strategy that incorporates diverse image types at different stages of training, stabilizing the convergence and improving the generalization. For the network architecture, we leverage a pre-trained SR prior to provide robust spatial reasoning capabilities, enhancing the model's ability to preserve textual information. Additionally, we employ a cross-attention mechanism to better integrate textual priors. To further reduce errors in textual priors, we utilize confidence scores to dynamically adjust the importance of textual features during training. Extensive experiments on real-world datasets demonstrate that our approach not only produces text images with more realistic visual appearances but also improves the accuracy of text structure.