🤖 AI Summary
This work addresses the limitations of continuous diffusion models in discrete sequence generation, which underperform compared to masked discrete diffusion models due to their time-step-dependent noise mechanisms. The authors propose the Discrete Stochastic Localization (DSL) framework, which leverages unit-sphere token embeddings to construct a Bayes-optimal denoiser invariant to the nominal signal-to-noise ratio (SNR) under a localization channel. This invariance enables DSL to uniformly support an entire family of per-token SNR schedules—including endpoint masking, random-order autoregressive generation, and hybrid continuous-discrete sampling—without requiring distillation or retraining. By integrating continuous state-space modeling with fine-tuning of pretrained masked discrete language models (MDLMs), DSL achieves significantly improved distributional fidelity (measured by MAUVE) on OpenWebText across 128 to 1024 sampling steps and enables highly efficient hybrid sampling in as few as 48 steps.
📝 Abstract
Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce \emph{Discrete Stochastic Localization} (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from $T{=}128$ to $T{=}1024$, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps -- without distillation or retraining.