Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Prior work comparing autoregressive (AR) and masked diffusion models (MDMs) suffers from unfair paradigm comparisons due to architectural mismatches—AR models typically adopt decoder-only Transformers, whereas MDMs commonly use encoder-only architectures—obscuring the true source of performance differences. This paper addresses this bias by introducing the first decoder-only Transformer implementation of MDM, enabling arbitrary-order token generation via masked diffusion modeling. We integrate an autoregressive-style training objective with temperature annealing to ensure equitable comparison with AR baselines. Experiments demonstrate that our architecture-neutral design eliminates structural confounds: it matches the perplexity of encoder-only MDMs while achieving ~25× inference speedup. These results establish the efficacy and flexibility of decoder-only Transformers for scalable diffusion-based sequence generation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) predominantly use autoregressive (AR) approaches, but masked diffusion models (MDMs) are emerging as viable alternatives. A key challenge in comparing AR and MDM paradigms is their typical architectural difference: AR models are often decoder-only, while MDMs have largely been encoder-only. This practice of changing both the modeling paradigm and architecture simultaneously makes direct comparisons unfair, as it's hard to distinguish whether observed differences stem from the paradigm itself or the architectural shift. This research evaluates MDMs within a decoder-only framework to: (1) equitably compare MDM (as Any-Order AR, or AO-AR) and standard AR paradigms. Our investigation suggests that the standard AO-AR objective, which averages over all token permutations, may benefit from refinement, as many permutations appear less informative compared to the language's inherent left-to-right structure. (2) Investigate architectural influences (decoder-only vs. encoder-only) within MDMs. We demonstrate that while encoder-only MDMs model a simpler conditional probability space, decoder-only MDMs can achieve dramatic generation speedups ($sim25 imes$) and comparable perplexity with temperature annealing despite modeling a vastly larger space, highlighting key trade-offs. This work thus decouples core paradigm differences from architectural influences, offering insights for future model design. Code is available at https://github.com/scxue/AO-GPT-MDM.

Problem

Research questions and friction points this paper is trying to address.

Compare AR and MDM paradigms fairly in decoder-only frameworks

Refine AO-AR objective by addressing uninformative token permutations

Analyze decoder-only vs encoder-only architectural trade-offs in MDMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only framework for masked diffusion models

Any-Order AR objective with token permutations

Temperature annealing for faster generation speed

🔎 Similar Papers

Simplified and Generalized Masked Diffusion for Discrete Data