Hierarchical Image Tokenization for Multi-Scale Image Super Resolution

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work proposes a multi-scale image super-resolution method based on visual autoregressive (VAR) modeling, addressing the limitations of existing approaches that typically support only fixed-scale outputs and rely on large models or extensive annotated data. The proposed framework introduces Hierarchical Image Tokens (HIT) to enable cross-scale semantic alignment and incorporates a Direct Preference Optimization (DPO) regularizer trained solely on low–high resolution image pairs. This design allows the model to generate high-resolution outputs at multiple scales in a single forward pass. Despite using only 300 million parameters—a relatively small scale for such tasks—the method achieves state-of-the-art performance without requiring external training data, significantly enhancing flexibility while reducing computational complexity.

📝 Abstract

We introduce a multi-scale Image Super Resolution (ISR) method building on recent advances in Visual Auto-Regressive (VAR) modeling. VAR models break image tokenization into additive, gradually increasing scales, using Residual Quantization (RQ), an approach that aligns perfectly with our target ISR task. Previous works taking advantage of this synergy suffer from two main shortcomings. First, due to the limitations in RQ, they only generate images at a predefined fixed scale, failing to map intermediate outputs to the corresponding image scales. They also rely on large backbones or a large corpus of annotated data to achieve better performance. To address both shortcomings, we introduce two novel components to the VAR training for ISR, aiming at increasing its flexibility and reducing its complexity. In particular, we introduce a) a \textbf{Hierarchical Image Tokenization (HIT)} approach that progressively represents images at different scales while enforcing token overlap across scales, and b) a \textbf{Direct Preference Optimization (DPO) regularization term} that, relying solely on the (LR,HR) pair, encourages the transformer to produce the latter over the former. Our proposed HIT acts as a strong inductive bias for the VAR training, resulting in a small model (300M params vs 1B params of VARSR), that achieves state-of-the-art results without external training data, and that delivers multi-scale outputs with a single forward pass.

Problem

Research questions and friction points this paper is trying to address.

Image Super Resolution

Multi-Scale

Visual Auto-Regressive

Residual Quantization

Hierarchical Tokenization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Image Tokenization

Direct Preference Optimization

Multi-Scale Super Resolution