🤖 AI Summary
Discrete diffusion models suffer from weak joint structural modeling and degraded low-step generation quality due to position-wise independent denoising in masked autoencoders. To address this, we propose the Latent-Discrete Diffusion Model (LDDM), a dual-channel framework that jointly models discrete token space and continuous latent embedding space. Crucially, LDDM introduces an explicit latent-variable channel to propagate cross-position dependencies, thereby strengthening joint structural modeling. We instantiate LDDM in two variants: fully joint (FUJI-LDDM) and sequential (SEQ-LDDM). Both are trained end-to-end using an ELBO-inspired objective that jointly optimizes discrete masked diffusion and continuous latent diffusion, learning high-informativeness representations tailored to the diffusion process. Experiments on unconditional text generation demonstrate that LDDM significantly outperforms existing discrete diffusion models, especially under low sampling budgets (≤16 steps), achieving substantial improvements in both sample quality and structural coherence.
📝 Abstract
We study discrete diffusion for language and other categorical data and focus on a common limitation of masked denoisers: reverse transitions typically factorize across positions, which can weaken joint structure and degrade quality in few-step generation. We propose emph{Latent Discrete Diffusion Models} (LDDMs), which couple a masked discrete diffusion over tokens with a continuous diffusion over latent embeddings. The latent channel provides a softer signal and carries cross-token dependencies that help resolve ambiguities. We present two instantiations: (i) FUJI-LDDMs, which perform fully joint denoising of tokens and latents, and (ii) SEQ-LDDMs, which sequentially resolve the latent and then the discrete chain conditionally on it. For both variants we derive ELBO-style objectives and discuss design choices to learn informative latents yet amenable to diffusoin modeling. In experiments, LDDMs yield improvements on unconditional generation metrics as compared to state-of-the-art masked discrete diffusion baselines, and are effective at lower sampling budgets, where unmasking many tokens per step is desirable.