ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of balancing long-range dependency modeling and computational efficiency in ultra-high-definition (UHD) image restoration, this paper proposes a diffusion Transformer (DiT) framework operating in a 32× height-compressed latent space, enabling the first end-to-end training and inference at full 2K resolution. Our key contributions are: (1) a latent-space pyramid variational autoencoder (LP-VAE), which organizes latent variables by frequency bands to enhance diffusion training stability; and (2) an efficient DiT architecture specifically designed to leverage high model capacity within an extremely low-dimensional latent space. Experiments demonstrate that our method significantly outperforms existing diffusion-based approaches on severely degraded images, achieving substantial PSNR and SSIM improvements. Moreover, it accelerates inference by 3–5× over prior methods, marking the first real-time-capable 2K image restoration solution based on diffusion models.

Technology Category

Application Category

📝 Abstract
Recent progress in generative models has significantly improved image restoration capabilities, particularly through powerful diffusion models that offer remarkable recovery of semantic details and local fidelity. However, deploying these models at ultra-high resolutions faces a critical trade-off between quality and efficiency due to the computational demands of long-range attention mechanisms. To address this, we introduce ZipIR, a novel framework that enhances efficiency, scalability, and long-range modeling for high-res image restoration. ZipIR employs a highly compressed latent representation that compresses image 32x, effectively reducing the number of spatial tokens, and enabling the use of high-capacity models like the Diffusion Transformer (DiT). Toward this goal, we propose a Latent Pyramid VAE (LP-VAE) design that structures the latent space into sub-bands to ease diffusion training. Trained on full images up to 2K resolution, ZipIR surpasses existing diffusion-based methods, offering unmatched speed and quality in restoring high-resolution images from severely degraded inputs.
Problem

Research questions and friction points this paper is trying to address.

Balancing quality and efficiency in ultra-high-resolution image restoration
Reducing computational demands of long-range attention mechanisms
Enhancing scalability and long-range modeling for high-res restoration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Highly compressed latent representation for efficiency
Latent Pyramid VAE design for easier training
Diffusion Transformer for high-resolution restoration
🔎 Similar Papers
No similar papers found.