🤖 AI Summary
Diffusion language models struggle to balance generation quality and inference efficiency due to difficulties in modeling inter-token dependencies. This work proposes a novel latent-space-guided diffusion generation paradigm: it first constructs a semantically continuous latent space using a fine-tuned autoencoder, then designs a diffusion prior over latent variables and integrates a consistency distillation mechanism. The approach outperforms existing baselines even without distillation while accelerating inference; when distillation is applied, the overhead of latent variable generation becomes negligible, substantially reducing overall computational cost without compromising high-quality text generation.
📝 Abstract
Diffusion language models intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampling quality and throughput. To solve this issue, we propose DiLaDiff, a variant of masked diffusion language models with three components: (1) a continuous latent space with semantic capabilities, learned by an auto-encoder fine-tuned from an existing masked diffusion language model; (2) a latent diffusion model learning the prior over the encoder distribution; (3) a consistency model distilling the learned prior into a few-step latent generative model. We show that, even without distillation, our latent-guided diffusion model outperforms the masked diffusion baseline while significantly accelerating inference. Consistency distillation further lowers the computational overhead of continuous diffusion, such that the latent is generated in negligible time compared to discrete decoding.