NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution

πŸ“… 2025-10-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing Real-ISR methods rely on pre-trained text-to-image diffusion models, facing a fundamental trade-off between efficiency and reconstruction quality: noise-initialized generation yields photorealistic results but suffers from slow inference, whereas degradation-guided initialization enables fast inference yet often introduces artifacts and hallucinations; moreover, fine-tuning fixed backbones (e.g., ControlNet or LoRA) limits generalization across diverse degradations. To address these limitations, we propose NSARMβ€”the first end-to-end Real-ISR framework incorporating **next-level autoregressive modeling**. Built upon the visual autoregressive model Infinity, NSARM introduces a dedicated transformation network and a two-stage training strategy to enable bit-level sequence prediction. By eliminating iterative diffusion sampling, NSARM achieves both high-fidelity reconstruction and millisecond-scale inference. Extensive experiments on multiple real-world datasets demonstrate substantial improvements over state-of-the-art methods, with superior degradation robustness, markedly reduced hallucination rates, and exceptional perceptual quality.

Technology Category

Application Category

πŸ“ Abstract
Most recent real-world image super-resolution (Real-ISR) methods employ pre-trained text-to-image (T2I) diffusion models to synthesize the high-quality image either from random Gaussian noise, which yields realistic results but is slow due to iterative denoising, or directly from the input low-quality image, which is efficient but at the price of lower output quality. These approaches train ControlNet or LoRA modules while keeping the pre-trained model fixed, which often introduces over-enhanced artifacts and hallucinations, suffering from the robustness to inputs of varying degradations. Recent visual autoregressive (AR) models, such as pre-trained Infinity, can provide strong T2I generation capabilities while offering superior efficiency by using the bitwise next-scale prediction strategy. Building upon next-scale prediction, we introduce a robust Real-ISR framework, namely Next-Scale Autoregressive Modeling (NSARM). Specifically, we train NSARM in two stages: a transformation network is first trained to map the input low-quality image to preliminary scales, followed by an end-to-end full-model fine-tuning. Such a comprehensive fine-tuning enhances the robustness of NSARM in Real-ISR tasks without compromising its generative capability. Extensive quantitative and qualitative evaluations demonstrate that as a pure AR model, NSARM achieves superior visual results over existing Real-ISR methods while maintaining a fast inference speed. Most importantly, it demonstrates much higher robustness to the quality of input images, showing stronger generalization performance. Project page: https://github.com/Xiangtaokong/NSARM
Problem

Research questions and friction points this paper is trying to address.

Enhancing real-world image super-resolution robustness and quality
Overcoming over-enhancement artifacts in diffusion-based super-resolution methods
Improving generalization across varying image degradation types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses autoregressive next-scale prediction for super-resolution
Trains transformation network then full-model fine-tuning
Enhances robustness without compromising generative capability
πŸ”Ž Similar Papers
No similar papers found.