🤖 AI Summary
Masked diffusion language models suffer from factorization error due to their parallel token updates, limiting high-quality generation within very few sampling steps. This work proposes the Infinite Masked Diffusion Model (IMDM), which introduces, for the first time, a stochastic infinite-state masking mechanism. IMDM retains the advantages of parallel decoding and bidirectional context while theoretically surpassing the lower bound on factorization error inherent in standard masked diffusion models. Compatible with pretrained weights and integrated with few-step distillation, IMDM significantly outperforms existing methods at extremely low sampling budgets on both LM1B and OpenWebText benchmarks. Empirical results on synthetic tasks further validate its effectiveness in mitigating factorization error.
📝 Abstract
Masked Diffusion Models (MDMs) have emerged as a promising alternative to autoregressive models in language modeling, offering the advantages of parallel decoding and bidirectional context processing within a simple yet effective framework. Specifically, their explicit distinction between masked tokens and data underlies their simple framework and effective conditional generation. However, MDMs typically require many sampling iterations due to factorization errors stemming from simultaneous token updates. We observe that a theoretical lower bound of the factorization error exists, which standard MDMs cannot reduce due to their use of a deterministic single-state mask. In this paper, we propose the Infinite Mask Diffusion Model (IMDM), which introduces a stochastic infinite-state mask to mitigate the theoretical bound while directly inheriting the benefits of MDMs, including the compatibility with pre-trained weights. We empirically demonstrate that MDM fails to perform few-step generation even in a simple synthetic task due to the factorization error bound, whereas IMDM can find an efficient solution for the same task. Finally, when equipped with appropriate distillation methods, IMDM surpasses existing few-step distillation methods at small step counts on LM1B and OpenWebText. Code is available at https://Ugness.github.io/official_imdm.