Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing generative super-resolution methods achieve strong performance on natural images but often distort character shapes when applied to text images, failing to simultaneously preserve text readability and visual fidelity. To address this, we propose TIGER, the first “text-first, then image” two-stage framework: Stage I employs a structure-aware network to reconstruct glyphs with high geometric accuracy; Stage II performs image super-resolution guided explicitly by the reconstructed glyphs. Our key innovation is establishing an explicit glyph-to-image guidance mechanism—breaking the longstanding trade-off between readability and perceptual quality. To support extreme-scale text super-resolution (×14.29), we introduce UltraZoom-ST, the first benchmark dataset specifically designed for scene text under severe degradation. Extensive experiments demonstrate that TIGER achieves state-of-the-art performance across multiple quantitative metrics, significantly improving character legibility and global visual consistency.

Technology Category

Application Category

📝 Abstract
Current generative super-resolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce extbf{TIGER} ( extbf{T}ext- extbf{I}mage extbf{G}uided sup extbf{E}r- extbf{R}esolution), a novel two-stage framework that breaks this trade-off through a extit{"text-first, image-later"} paradigm. extbf{TIGER} explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and then uses them to guide subsequent full-image super-resolution. This glyph-to-image guidance ensures both high fidelity and visual consistency. To support comprehensive training and evaluation, we also contribute the extbf{UltraZoom-ST} (UltraZoom-Scene Text), the first scene text dataset with extreme zoom ( extbf{$ imes$14.29}). Extensive experiments show that extbf{TIGER} achieves extbf{state-of-the-art} performance, enhancing readability while preserving overall image quality.
Problem

Research questions and friction points this paper is trying to address.

Restores distorted text in super-resolution images
Decouples glyph reconstruction from image enhancement
Breaks trade-off between image quality and readability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework decouples glyph restoration from enhancement
Glyph structure guidance ensures high fidelity and consistency
Novel dataset supports extreme zoom text super-resolution training
🔎 Similar Papers
No similar papers found.
Minxing Luo
Minxing Luo
Unknown affiliation
L
Linlong Fan
vivo Mobile Communication Co. Ltd
W
Wang Qiushi
SDS, The Chinese University of Hong Kong, Shenzhen
G
Ge Wu
VCIP, CS, Nankai University
Y
Yiyan Luo
vivo Mobile Communication Co. Ltd
Y
Yuhang Yu
vivo Mobile Communication Co. Ltd
Jinwei Chen
Jinwei Chen
vivo
computer vision
Yaxing Wang
Yaxing Wang
Associate professor, Nankai University
Deep learningGANsImage-to-image translationTransfer learning
Qingnan Fan
Qingnan Fan
Lead researcher @ VIVO | Prev Tencent, Stanford, SDU
Diffusion models3D VisionComputer Graphics
J
Jian Yang
VCIP, CS, Nankai University