Multi-scale Image Super Resolution with a Single Auto-Regressive Model

๐Ÿ“… 2025-06-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing VARSR methods suffer from fixed-resolution processing, reliance on prohibitively large models (e.g., 1B parameters), and dependence on private datasets. This paper introduces the first multi-scale autoregressive super-resolution framework capable of unified half- and full-scale upsampling inference. Our approach addresses these limitations through three key contributions: (1) a multi-scale hierarchical image tokenization scheme that ensures cross-scale semantic consistency for residual modeling; (2) the first integration of Direct Preference Optimization (DPO) into visual autoregressive training, significantly enhancing perceptual quality and fidelity; and (3) a compact 300M-parameter VAR architecture achieving state-of-the-art performance without external dataโ€”reducing parameter count by 2/3 relative to VAR-d24 while maintaining superior reconstruction accuracy and detail synthesis.

Technology Category

Application Category

๐Ÿ“ Abstract
In this paper we tackle Image Super Resolution (ISR), using recent advances in Visual Auto-Regressive (VAR) modeling. VAR iteratively estimates the residual in latent space between gradually increasing image scales, a process referred to as next-scale prediction. Thus, the strong priors learned during pre-training align well with the downstream task (ISR). To our knowledge, only VARSR has exploited this synergy so far, showing promising results. However, due to the limitations of existing residual quantizers, VARSR works only at a fixed resolution, i.e. it fails to map intermediate outputs to the corresponding image scales. Additionally, it relies on a 1B transformer architecture (VAR-d24), and leverages a large-scale private dataset to achieve state-of-the-art results. We address these limitations through two novel components: a) a Hierarchical Image Tokenization approach with a multi-scale image tokenizer that progressively represents images at different scales while simultaneously enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the LR and HR tokenizations, encourages the transformer to produce the latter over the former. To the best of our knowledge, this is the first time a quantizer is trained to force semantically consistent residuals at different scales, and the first time that preference-based optimization is used to train a VAR. Using these two components, our model can denoise the LR image and super-resolve at half and full target upscale factors in a single forward pass. Additionally, we achieve extit{state-of-the-art results on ISR}, while using a small model (300M params vs ~1B params of VARSR), and without using external training data.
Problem

Research questions and friction points this paper is trying to address.

Overcoming fixed-resolution limitation in VAR-based super-resolution models
Enhancing multi-scale image tokenization for consistent residual learning
Achieving state-of-the-art ISR with smaller models and no external data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Image Tokenization for multi-scale representation
Direct Preference Optimization for regularization
Single forward pass for denoising and super-resolution
๐Ÿ”Ž Similar Papers
No similar papers found.