One-Step Diffusion-based Real-World Image Super-Resolution with Visual Perception Distillation

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion-based super-resolution methods employ knowledge distillation for accelerated inference but suffer from insufficient semantic alignment, resulting in suboptimal CLIPIQA scores and compromised perceptual quality and semantic fidelity. To address this, we propose VPD-SR, a single-step diffusion super-resolution framework that overcomes the efficiency bottleneck of multi-step sampling. We introduce a novel visual-perception-aware distillation paradigm, incorporating Explicit Semantic Supervision (ESS) and High-Frequency Perception (HFP) loss—enabling, for the first time, joint optimization of semantic consistency and texture realism. Our method integrates CLIP-based semantic embeddings, adversarial training, and a one-step sampling mechanism. Extensive experiments on both synthetic and real-world datasets demonstrate that VPD-SR consistently outperforms state-of-the-art methods and even the teacher model, achieving significant CLIPIQA improvements and optimal perceptual quality with only a single denoising step.

Technology Category

Application Category

📝 Abstract
Diffusion-based models have been widely used in various visual generation tasks, showing promising results in image super-resolution (SR), while typically being limited by dozens or even hundreds of sampling steps. Although existing methods aim to accelerate the inference speed of multi-step diffusion-based SR methods through knowledge distillation, their generated images exhibit insufficient semantic alignment with real images, resulting in suboptimal perceptual quality reconstruction, specifically reflected in the CLIPIQA score. These methods still have many challenges in perceptual quality and semantic fidelity. Based on the challenges, we propose VPD-SR, a novel visual perception diffusion distillation framework specifically designed for SR, aiming to construct an effective and efficient one-step SR model. Specifically, VPD-SR consists of two components: Explicit Semantic-aware Supervision (ESS) and High-Frequency Perception (HFP) loss. Firstly, the ESS leverages the powerful visual perceptual understanding capabilities of the CLIP model to extract explicit semantic supervision, thereby enhancing semantic consistency. Then, Considering that high-frequency information contributes to the visual perception quality of images, in addition to the vanilla distillation loss, the HFP loss guides the student model to restore the missing high-frequency details in degraded images that are critical for enhancing perceptual quality. Lastly, we expand VPD-SR in adversarial training manner to further enhance the authenticity of the generated content. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed VPD-SR achieves superior performance compared to both previous state-of-the-art methods and the teacher model with just one-step sampling.
Problem

Research questions and friction points this paper is trying to address.

Accelerate diffusion-based SR with one-step sampling
Enhance semantic alignment with real images
Improve perceptual quality via high-frequency details
Innovation

Methods, ideas, or system contributions that make the work stand out.

One-step diffusion model for super-resolution
CLIP-based semantic supervision for alignment
High-frequency perception loss for details
🔎 Similar Papers
No similar papers found.
Xue Wu
Xue Wu
State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi’an 710071, Shaanxi, China
Jingwei Xin
Jingwei Xin
Xidian university
machine learning computer vision
Zhijun Tu
Zhijun Tu
Huawei Noah's Ark Lab
Efficient LLM and AIGC systemModel Compression
J
Jie Hu
Huawei Noah’s Ark Lab, Beijing 100084, China
J
Jie Li
State Key Laboratory of Integrated Services Networks, School of Electronic Engineering, Xidian University, Xi’an 710071, Shaanxi, China
Nannan Wang
Nannan Wang
Professor, Xidian University
Computer VisionMachine LearningPattern Recognition
X
Xinbo Gao
Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing 400065, China