SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
KL-regularized variational autoencoders (KL-VAEs) for image tokenization face two key bottlenecks: reliance on adversarial training and computationally expensive multi-step diffusion decoding. Method: We propose the Single-Step Diffusion Decoder (SSDD), a Transformer-based architecture that eliminates GAN-based losses and instead employs only KL regularization and knowledge distillation for end-to-end optimization, enabling genuine one-step reconstruction. Contribution/Results: SSDD is the first diffusion decoder achieving efficient single-step generation without adversarial training. Experiments show substantial improvements: FID drops significantly from 0.87 to 0.50; throughput increases by 1.4×; sampling speed accelerates by 3.8×; and SSDD serves as a plug-and-play replacement for existing KL-VAE tokenizers—preserving high-fidelity reconstruction while drastically reducing inference overhead.

Technology Category

Application Category

📝 Abstract
Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4 imes$ higher throughput and preserve generation quality of DiTs with $3.8 imes$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.
Problem

Research questions and friction points this paper is trying to address.

Improving image tokenizer efficiency by replacing KL-VAE with diffusion decoders
Eliminating adversarial training while maintaining high reconstruction quality
Achieving faster image sampling without compromising generation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-step diffusion decoder for efficient image tokenization
Transformer components and GAN-free training architecture
Distillation enables single-step reconstruction without adversarial losses
🔎 Similar Papers
No similar papers found.