🤖 AI Summary
To address the limited modeling capacity of Vision Transformers (ViTs) in single-image super-resolution (SISR), this paper proposes a two-stage ViT framework. In Stage I, a self-supervised pre-training scheme is introduced using image colorization as a proxy task to enhance the model’s general representation capability for texture and structural priors. In Stage II, a residual high-frequency image prediction mechanism is incorporated, coupled with residual upsampling, to simplify the super-resolution learning process. Departing from conventional supervised pre-training paradigms, this work is the first to deeply integrate colorization into a ViT-based SISR architecture. Evaluated on DIV2K, the method achieves 22.90 dB PSNR and 0.712 SSIM—significantly outperforming baseline ViT-based approaches. These results validate the effectiveness of self-supervised representation learning and residual high-frequency modeling in boosting ViT performance for SISR.
📝 Abstract
In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.