Masked Diffusion Models are Secretly Learned-Order Autoregressive Models

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limited modeling efficiency of Masked Diffusion Models (MDMs) for discrete sequence generation, stemming from their fixed decoding order. We propose a training framework that learns the decoding order end-to-end. Our core method establishes an explicit mapping between the decoding order and a multivariate noise schedule, thereby relaxing the conventional invariance assumption of MDM objectives with respect to noise scheduling. The optimization objective is reformulated as a decoding-order-dependent weighted autoregressive loss. We provide theoretical analysis proving that MDMs are equivalent to autoregressive models with learnable decoding orders. Empirical results demonstrate that our framework substantially improves both training efficiency and generation quality, outperforming standard MDMs and representative autoregressive baselines across multiple sequence modeling tasks.

Technology Category

Application Category

📝 Abstract

Masked Diffusion Models (MDMs) have emerged as one of the most promising paradigms for generative modeling over discrete domains. It is known that MDMs effectively train to decode tokens in a random order, and that this ordering has significant performance implications in practice. This observation raises a fundamental question: can we design a training framework that optimizes for a favorable decoding order? We answer this in the affirmative, showing that the continuous-time variational objective of MDMs, when equipped with multivariate noise schedules, can identify and optimize for a decoding order during training. We establish a direct correspondence between decoding order and the multivariate noise schedule and show that this setting breaks invariance of the MDM objective to the noise schedule. Furthermore, we prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders, which establishes them as auto-regressive models with learnable orders.

Problem

Research questions and friction points this paper is trying to address.

Optimizing decoding order in masked diffusion models for discrete domains

Establishing correspondence between decoding order and multivariate noise schedules

Proving MDMs function as learnable-order autoregressive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes decoding order via multivariate noise schedules

Breaks invariance to noise schedule in training objective

Decomposes into weighted auto-regressive losses with learnable orders

🔎 Similar Papers

Simplified and Generalized Masked Diffusion for Discrete Data