Visual Autoregressive Modeling for Image Super-Resolution

πŸ“… 2025-01-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Addressing the longstanding trade-off among fidelity, perceptual realism, and inference efficiency in image super-resolution (ISR), this paper introduces VARSRβ€”the first ISR framework to incorporate visual autoregressive modeling. Our method employs a prefix-conditioned, multi-scale autoregressive architecture that generates high-resolution details progressively across scales. Key innovations include: (1) a scale-aligned rotary position embedding (RoPE) to preserve spatial coherence across resolutions; (2) image-level classifier-free guidance for enhanced perceptual quality without class labels; and (3) a diffusion-based refinement module that quantizes and refines residual errors. On benchmark multi-scale super-resolution tasks, VARSR achieves state-of-the-art fidelity (PSNR/SSIM) and perceptual quality (LPIPS/FID) simultaneously. Moreover, it accelerates inference by 3–5Γ— over leading diffusion-based ISR models, significantly advancing the frontier of efficient, high-fidelity super-resolution.

Technology Category

Application Category

πŸ“ Abstract
Image Super-Resolution (ISR) has seen significant progress with the introduction of remarkable generative models. However, challenges such as the trade-off issues between fidelity and realism, as well as computational complexity, have also posed limitations on their application. Building upon the tremendous success of autoregressive models in the language domain, we propose extbf{VARSR}, a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction. To effectively integrate and preserve semantic information in low-resolution images, we propose using prefix tokens to incorporate the condition. Scale-aligned Rotary Positional Encodings are introduced to capture spatial structures and the diffusion refiner is utilized for modeling quantization residual loss to achieve pixel-level fidelity. Image-based Classifier-free Guidance is proposed to guide the generation of more realistic images. Furthermore, we collect large-scale data and design a training process to obtain robust generative priors. Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods. Our codes will be released at https://github.com/qyp2000/VARSR.
Problem

Research questions and friction points this paper is trying to address.

Image Super-Resolution
Clarity Enhancement
Computational Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

VARSR
Image Super-Resolution
Fast Computation
πŸ”Ž Similar Papers
No similar papers found.
Yunpeng Qu
Yunpeng Qu
Tsinghua University
K
Kun Yuan
Kuaishou Technology, Beijing, China
Jinhua Hao
Jinhua Hao
Kuaishou Technology
Computer VisionGenerative AIFluid Mechanics
K
Kai Zhao
Kuaishou Technology, Beijing, China
Q
Qizhi Xie
Tsinghua University, Beijing, China; Kuaishou Technology, Beijing, China
M
Ming Sun
Kuaishou Technology, Beijing, China
C
Chao Zhou
Kuaishou Technology, Beijing, China