One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models suffer from poor generalization beyond training resolutions; high-resolution sampling is computationally expensive, while image-space super-resolution (ISR) post-processing introduces artifacts and incurs high latency. To address this, we propose the Latent Upscaler Adapter (LUA), the first lightweight upsampling module embedded directly in the latent space—applied once after VAE encoding and before VAE decoding—to achieve multi-scale upsampling. LUA employs a shared Swin Transformer backbone with scale-specific pixel-rearrangement heads, ensuring compatibility with arbitrary VAEs without modifying the original diffusion model or increasing sampling steps. Experiments demonstrate that LUA accelerates inference by nearly 3× compared to image-space ISR; generating 1024×1024 images adds only 0.42 seconds overhead while achieving quality comparable to native high-resolution models. This significantly improves inference efficiency and deployment flexibility.

Technology Category

Application Category

📝 Abstract
Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.
Problem

Research questions and friction points this paper is trying to address.

Overcoming slow high-resolution sampling in diffusion models
Reducing artifacts and latency from post-hoc image super-resolution
Enabling scalable high-fidelity synthesis without model retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Upscaler Adapter performs super-resolution in latent space
Shared Swin backbone with pixel-shuffle heads supports 2x and 4x scaling
Drop-in component enables high-resolution synthesis without model modifications
🔎 Similar Papers
No similar papers found.