PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the limitations of conventional high-resolution text-to-image systems, which rely on reconstruction-oriented latent-space decoders that struggle to produce fine details and incur substantial computational costs. The authors propose PiD, a pixel diffusion decoder that reformulates latent variable decoding as a conditional pixel-space diffusion process, unifying decoding and upsampling by directly denoising in pixel space for efficient detail synthesis. PiD incorporates a lightweight sigma-aware adapter enabling early termination of latent diffusion, is compatible with both VAE and semantic latent representations, and leverages DMD2 distillation to achieve inference in just four steps. On an RTX 5090, it decodes 512×512 latents into 2048×2048 images in under one second (peak VRAM: 13 GB), with GB200 achieving as fast as 210 ms—six times faster than super-resolution joint diffusion—while delivering superior visual quality.

📝 Abstract

Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes $4\times$ and even $8\times$ upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of $512 \times 512$ images into $2048 \times 2048$ pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about $6\times$ faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.

Problem

Research questions and friction points this paper is trying to address.

latent decoding

high-resolution image generation

pixel diffusion

decoder efficiency

megapixel synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel Diffusion

Latent Decoding

Conditional Generation