🤖 AI Summary
Continuous diffusion language models have long been considered difficult to scale due to the absence of efficient architectures and theoretical foundations. This work addresses this limitation by re-engineering the Plaid architecture to align with modern discrete diffusion paradigms and introducing a likelihood-based training framework to systematically investigate scalability. For the first time, scaling laws for continuous diffusion models are established, revealing that the optimal noise schedule naturally induces linear information loss, thereby uniformly distributing denoising difficulty across timesteps. Furthermore, likelihood optimization is shown to encourage the emergence of structured geometric properties in the embedding space, significantly enhancing performance. The proposed RePlaid model achieves a state-of-the-art perplexity of 22.1 on OpenWebText among continuous diffusion models, operates with only 20× the computational cost of autoregressive models, uses fewer parameters than Duo while outperforming it, and surpasses MDLM under overtraining conditions.
📝 Abstract
While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only $20\times$ compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of $22.1$ among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.