Error Bounds and Optimal Schedules for Masked Diffusions with Factorized Approximations

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Masked diffusion models (MDMs) for discrete data generation suffer from a fundamental trade-off between computational efficiency and sampling accuracy due to the conditional independence approximation inherent in standard masking schemes. Method: We propose a non-constant masking schedule that dynamically adjusts the number of unmasked tokens during sampling, decoupling computational cost from sequence length. We derive, for the first time, a sequence-length-independent upper bound on the relative entropy error and construct an optimal schedule based on the data’s information distribution. Crucially, we design the sampling algorithm directly—bypassing intricate inversion derivations—to enable concise and rigorous theoretical analysis. Contribution/Results: Our analysis reveals the core mechanism underlying MDMs’ efficiency, establishing that optimal scheduling must align with the structural characteristics of the information profile. This yields an interpretable, computationally tractable optimization framework for low-bias, fast discrete sequence generation.

Technology Category

Application Category

📝 Abstract
Recently proposed generative models for discrete data, such as Masked Diffusion Models (MDMs), exploit conditional independence approximations to reduce the computational cost of popular Auto-Regressive Models (ARMs), at the price of some bias in the sampling distribution. We study the resulting computation-vs-accuracy trade-off, providing general error bounds (in relative entropy) that depend only on the average number of tokens generated per iteration and are independent of the data dimensionality (i.e. sequence length), thus supporting the empirical success of MDMs. We then investigate the gain obtained by using non-constant schedule sizes (i.e. varying the number of unmasked tokens during the generation process) and identify the optimal schedule as a function of a so-called information profile of the data distribution, thus allowing for a principled optimization of schedule sizes. We define methods directly as sampling algorithms and do not use classical derivations as time-reversed diffusion processes, leading us to simple and transparent proofs.
Problem

Research questions and friction points this paper is trying to address.

Analyzing computation-accuracy trade-off in masked diffusion models
Establishing dimension-independent error bounds for discrete generative models
Optimizing dynamic scheduling strategies using data distribution information profiles
Innovation

Methods, ideas, or system contributions that make the work stand out.

Factorized approximations reduce ARM computational cost
Error bounds independent of data dimensionality
Optimal non-constant schedules via information profiles
Hugo Lavenant
Hugo Lavenant
Bocconi University
G
Giacomo Zanella
Bocconi University, Department of Decision Sciences and BIDSA, Milan, Italy