Hierarchical Image Tokenization for Multi-Scale Image Super Resolution

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

172K/year
🤖 AI Summary
This work proposes a multi-scale image super-resolution method based on visual autoregressive (VAR) modeling, addressing the limitations of existing approaches that typically support only fixed-scale outputs and rely on large models or extensive annotated data. The proposed framework introduces Hierarchical Image Tokens (HIT) to enable cross-scale semantic alignment and incorporates a Direct Preference Optimization (DPO) regularizer trained solely on low–high resolution image pairs. This design allows the model to generate high-resolution outputs at multiple scales in a single forward pass. Despite using only 300 million parameters—a relatively small scale for such tasks—the method achieves state-of-the-art performance without requiring external training data, significantly enhancing flexibility while reducing computational complexity.
📝 Abstract
We introduce a multi-scale Image Super Resolution (ISR) method building on recent advances in Visual Auto-Regressive (VAR) modeling. VAR models break image tokenization into additive, gradually increasing scales, using Residual Quantization (RQ), an approach that aligns perfectly with our target ISR task. Previous works taking advantage of this synergy suffer from two main shortcomings. First, due to the limitations in RQ, they only generate images at a predefined fixed scale, failing to map intermediate outputs to the corresponding image scales. They also rely on large backbones or a large corpus of annotated data to achieve better performance. To address both shortcomings, we introduce two novel components to the VAR training for ISR, aiming at increasing its flexibility and reducing its complexity. In particular, we introduce a) a \textbf{Hierarchical Image Tokenization (HIT)} approach that progressively represents images at different scales while enforcing token overlap across scales, and b) a \textbf{Direct Preference Optimization (DPO) regularization term} that, relying solely on the (LR,HR) pair, encourages the transformer to produce the latter over the former. Our proposed HIT acts as a strong inductive bias for the VAR training, resulting in a small model (300M params vs 1B params of VARSR), that achieves state-of-the-art results without external training data, and that delivers multi-scale outputs with a single forward pass.
Problem

Research questions and friction points this paper is trying to address.

Image Super Resolution
Multi-Scale
Visual Auto-Regressive
Residual Quantization
Hierarchical Tokenization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Image Tokenization
Direct Preference Optimization
Multi-Scale Super Resolution
Visual Auto-Regressive Modeling
Residual Quantization
🔎 Similar Papers
No similar papers found.