Optimal Inference Schedules for Masked Diffusion Models

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This paper investigates the theoretical limits and efficient implementation of high-quality parallel sampling in masked diffusion models (MDMs), addressing the fundamental question: “How can parallelism be maximized without compromising generation quality?” Method: We derive, for the first time under arbitrary data distributions, an exact characterization of the expected discrepancy between the true and sampled distributions. Leveraging information-theoretic principles, we establish tight upper and lower bounds based on total correlation and dual total correlation. Furthermore, we uncover a deep connection between sampling steps and univariate function approximation capacity, proving that O(log n) steps suffice for lossless sampling under natural regularity conditions. Our scheduling strategy explicitly exploits distributional structure by integrating information theory with approximation theory. Contribution/Results: We provide the first provably efficient theoretical framework for MDM inference, enabling significantly higher sampling throughput while preserving sample fidelity—without empirical heuristics or quality degradation.

Technology Category

Application Category

📝 Abstract

A major bottleneck of standard auto-regressive large language models is that their inference process is inherently sequential, resulting in very long and costly inference times. To circumvent this, practitioners proposed a class of language models called diffusion language models, of which the masked diffusion model (MDM) is the most successful. The MDM is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel. However, there is very limited rigorous understanding of how much parallel sampling these models can perform without noticeable degradation in their sampling performance. Prior work of Li and Cai obtained some preliminary bounds, but these are not tight for many natural classes of distributions. In this work, we give a new, exact characterization of the expected divergence between the true distribution and the sampled distribution, for any distribution and any unmasking schedule for the sampler, showing an elegant connection to the theory of univariate function approximation. By leveraging this connection, we then attain a number of novel lower and upper bounds for this problem. While the connection to function approximation in principle gives the optimal unmasking schedule for any distribution, we show that it is in general impossible to compete with it without strong a priori knowledge of the distribution, even in seemingly benign settings. However, we also demonstrate new upper bounds and new sampling schedules in terms of well-studied information-theoretic properties of the base distribution, namely, its total correlation and dual total correlation, which show that in some natural settings, one can sample in $O(log n)$ steps without any visible loss in performance, where $n$ is the total sequence length.

Problem

Research questions and friction points this paper is trying to address.

Characterizing divergence between true and sampled distributions in masked diffusion models

Determining optimal unmasking schedules for parallel token sampling

Establishing bounds on sampling steps using information-theoretic properties

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exact characterization of divergence for unmasking schedules

Optimal unmasking schedules via function approximation theory

Sampling in logarithmic steps using information-theoretic properties

🔎 Similar Papers

Learning Diffusion Priors from Observations by Expectation Maximization