🤖 AI Summary
Discrete tasks such as language modeling remain challenging for existing diffusion models due to the fundamental mismatch between discrete token spaces and continuous diffusion paradigms, leading to information loss and a lack of rigorous theoretical grounding. To bridge this gap, we propose a continuous diffusion language model grounded in statistical manifold theory: token embeddings are mapped onto a Riemannian manifold, where both forward diffusion and reverse generative processes operate in the continuous geometric space. We establish, for the first time, a formal theoretical correspondence between discrete diffusion and probability flows on manifolds. Our method introduces geometry-aware diffusion dynamics and a radial-symmetry-driven, sampling-free training framework that circumvents the difficulties of high-dimensional manifold modeling. Experiments demonstrate that our model consistently outperforms discrete diffusion baselines on standard language modeling benchmarks, approaches autoregressive performance, and exhibits strong generalization across multimodal tasks.
📝 Abstract
Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at href{https://github.com/harryjo97/RDLM}{https://github.com/harryjo97/RDLM}.