Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited modeling capacity of Vision Transformers (ViTs) in single-image super-resolution (SISR), this paper proposes a two-stage ViT framework. In Stage I, a self-supervised pre-training scheme is introduced using image colorization as a proxy task to enhance the model’s general representation capability for texture and structural priors. In Stage II, a residual high-frequency image prediction mechanism is incorporated, coupled with residual upsampling, to simplify the super-resolution learning process. Departing from conventional supervised pre-training paradigms, this work is the first to deeply integrate colorization into a ViT-based SISR architecture. Evaluated on DIV2K, the method achieves 22.90 dB PSNR and 0.712 SSIM—significantly outperforming baseline ViT-based approaches. These results validate the effectiveness of self-supervised representation learning and residual high-frequency modeling in boosting ViT performance for SISR.

Technology Category

Application Category

📝 Abstract
In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.
Problem

Research questions and friction points this paper is trying to address.

Improves single image super-resolution using a two-stage Vision Transformer.
Uses self-supervised colorization pretraining to learn generalizable visual representations.
Simplifies residual learning by predicting high-frequency residuals for upsampling.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with colorization pretraining
Residual upsampling via high-frequency residual prediction
Self-supervised learning for image restoration tasks
🔎 Similar Papers
No similar papers found.
A
Aditya Chaudhary
LNMIIT, Jaipur, India
P
Prachet Dev Singh
LNMIIT, Jaipur, India
Ankit Jha
Ankit Jha
Researcher and Faculty, CSE, The LNMIIT Jaipur
Remote SensingComputer VisionMachine LearningVLMsPrompt Learning