🤖 AI Summary
This work addresses the limited modeling efficiency of Masked Diffusion Models (MDMs) for discrete sequence generation, stemming from their fixed decoding order. We propose a training framework that learns the decoding order end-to-end. Our core method establishes an explicit mapping between the decoding order and a multivariate noise schedule, thereby relaxing the conventional invariance assumption of MDM objectives with respect to noise scheduling. The optimization objective is reformulated as a decoding-order-dependent weighted autoregressive loss. We provide theoretical analysis proving that MDMs are equivalent to autoregressive models with learnable decoding orders. Empirical results demonstrate that our framework substantially improves both training efficiency and generation quality, outperforming standard MDMs and representative autoregressive baselines across multiple sequence modeling tasks.
📝 Abstract
Masked Diffusion Models (MDMs) have emerged as one of the most promising paradigms for generative modeling over discrete domains. It is known that MDMs effectively train to decode tokens in a random order, and that this ordering has significant performance implications in practice. This observation raises a fundamental question: can we design a training framework that optimizes for a favorable decoding order? We answer this in the affirmative, showing that the continuous-time variational objective of MDMs, when equipped with multivariate noise schedules, can identify and optimize for a decoding order during training. We establish a direct correspondence between decoding order and the multivariate noise schedule and show that this setting breaks invariance of the MDM objective to the noise schedule. Furthermore, we prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders, which establishes them as auto-regressive models with learnable orders.