VOSR: A Vision-Only Generative Model for Image Super-Resolution

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitation of existing generative image super-resolution methods, which predominantly rely on text-to-image diffusion models and overlook the inherently visual nature of the restoration task. The authors propose VOSR, a purely vision-driven generative super-resolution framework that leverages a pretrained visual encoder to extract semantic and spatial information for guidance. A novel guidance mechanism tailored specifically for restoration replaces the conventional unconditional branch. After training a multi-step diffusion process from scratch, the model is distilled into a single-step efficient generator. VOSR demonstrates, for the first time, that high-quality generative super-resolution can be achieved without multimodal pretraining, attaining superior or comparable perceptual quality, structural fidelity, and inference efficiency on both synthetic and real-world datasets, while requiring less than one-tenth of the training cost of typical text-to-image approaches.

Technology Category

Application Category

📝 Abstract

Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale text-image data. While effective, this paradigm starts from a generic T2I generator, despite that SR is fundamentally a low-resolution (LR) input-conditioned image restoration task. In this work, we investigate whether an SR model trained purely on visual data can rival T2I-based ones. To this end, we propose VOSR, a Vision-Only generative framework for SR. We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance. We then revisit classifier-free guidance for training generative models and show that the standard unconditional branch is ill-suited to restoration models trained from scratch. We therefore replace it with a restoration-oriented guidance strategy that preserves weak LR anchors. Built upon these designs, we first train a multi-step VOSR model from scratch and then distill it into a one-step model for efficient inference. VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods, yet in both multi-step and one-step settings, it achieves competitive or even better perceptual quality and efficiency, while producing more faithful structures with fewer hallucinations on both synthetic and real-world benchmarks. Our results, for the first time, show that high-quality generative SR can be achieved without multimodal pretraining. The code and models can be found at https://github.com/cswry/VOSR.

Problem

Research questions and friction points this paper is trying to address.

image super-resolution

generative model

vision-only

diffusion model

image restoration

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-only

image super-resolution

diffusion model