Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion decoders suffer from high latency due to iterative sampling, hindering their applicability in real-time or large-scale vision tasks. This work proposes a two-stage acceleration framework that first employs multi-scale progressive sampling to reconstruct images hierarchically from low to high resolution, and then distills the diffusion process at each scale into a single-step denoising model. The approach achieves nearly lossless reconstruction quality while accelerating decoding by an order of magnitude, with a theoretical speedup complexity of $\mathcal{O}(\log n)$. This provides a practical and scalable solution for efficient visual tokenization.

Technology Category

Application Category

📝 Abstract
Image tokenization plays a central role in modern generative modeling by mapping visual inputs into compact representations that serve as an intermediate signal between pixels and generative models. Diffusion-based decoders have recently been adopted in image tokenization to reconstruct images from latent representations with high perceptual fidelity. In contrast to diffusion models used for downstream generation, these decoders are dedicated to faithful reconstruction rather than content generation. However, their iterative sampling process introduces significant latency, making them impractical for real-time or large-scale applications. In this work, we introduce a two-stage acceleration framework to address this inefficiency. First, we propose a multi-scale sampling strategy, where decoding begins at a coarse resolution and progressively refines the output by doubling the resolution at each stage, achieving a theoretical speedup of $\mathcal{O}(\log n)$ compared to standard full-resolution sampling. Second, we distill the diffusion decoder at each scale into a single-step denoising model, enabling fast and high-quality reconstructions in a single forward pass per scale. Together, these techniques yield an order-of-magnitude reduction in decoding time with little degradation in output quality. Our approach provides a practical pathway toward efficient yet expressive image tokenizers. We hope it serves as a foundation for future work in efficient visual tokenization and downstream generation.
Problem

Research questions and friction points this paper is trying to address.

diffusion decoders
image tokenization
iterative sampling
latency
real-time applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-scale sampling
one-step distillation
diffusion decoder
image tokenization
efficient reconstruction
🔎 Similar Papers
No similar papers found.
C
Chuhan Wang
University of California San Diego
Hao Chen
Hao Chen
Carnegie Mellon University
Deep LearningRepresentation Learning