Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

252K/year

🤖 AI Summary

Existing single-image super-resolution (SISR) models suffer from severe generalization bottlenecks, exhibiting sharp performance degradation at extreme upscaling factors far beyond their training scale (e.g., ×256). To address this, we propose a chain-based progressive scaling framework that enables extreme magnification without additional training. Our method decomposes large-factor upsampling into controllable, multi-stage frequency elevation via a scale-autoregressive lightweight backbone reuse mechanism. Furthermore, we introduce a human-preference-guided multi-scale text-prompt alignment module, integrating diffusion-based super-resolution, vision-language model (VLM)-driven prompt generation, and GRPO-based reinforcement fine-tuning to ensure cross-scale semantic consistency. Using only a standard ×4 diffusion SR model, our approach achieves high-fidelity reconstruction at ×256 magnification and beyond—outperforming state-of-the-art methods significantly in perceptual quality, geometric fidelity, and texture accuracy.

Technology Category

Application Category

📝 Abstract

Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/ .

Problem

Research questions and friction points this paper is trying to address.

Overcoming collapse of SISR models beyond trained scale factors

Achieving extreme super-resolution via autoregressive scale decomposition

Aligning multi-scale text prompts with human preference for guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive chain for extreme super-resolution

Multi-scale-aware prompts enhance visual cues

GRPO fine-tuning aligns text with human preference

🔎 Similar Papers

No similar papers found.