Sharp Convergence Rates for Masked Diffusion Models

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the limitations of existing theoretical analyses of Euler and First-Hitting-Time (FHS) samplers in masked diffusion models, which rely on overly strong assumptions based on KL divergence and fail to establish convergence guarantees for FHS. The authors propose a novel convergence analysis framework grounded in total variation (TV) distance, leveraging TV error decomposition and path decoupling techniques. This approach yields the first tight dimension- and accuracy-dependent upper and lower bounds for the Euler method and rigorously proves that the sampling error of FHS is solely determined by the score estimation error and is information-theoretically optimal. By significantly weakening assumptions on score estimation and eliminating reliance on proxy initialization, the framework provides sharper and more general theoretical guarantees for these two prominent classes of samplers.

Technology Category

Application Category

📝 Abstract

Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, with masked (absorbing-rate) variants emerging as competitive alternatives to autoregressive models. Among existing samplers, the Euler method remains the standard choice in many applications, and more recently, the First-Hitting Sampler (FHS) has shown considerable promise for masked diffusion models. Despite their practical success, the theoretical understanding of these samplers remains limited. Existing analyses are conducted in Kullback-Leibler (KL) divergence, which often yields loose parameter dependencies and requires strong assumptions on score estimation. Moreover, these guarantees do not cover recently developed high-performance sampler of FHS. In this work, we first develop a direct total-variation (TV) based analysis for the Euler method that overcomes these limitations. Our results relax assumptions on score estimation, improve parameter dependencies, and establish convergence guarantees without requiring any surrogate initialization. Also for this setting, we provide the first convergence lower bound for the Euler sampler, establishing tightness with respect to both the data dimension $d$ and the target accuracy $\varepsilon$. Finally, we analyze the FHS sampler and show that it incurs no sampling error beyond that induced by score estimation, which we show to be tight with a matching lower error bound. Overall, our analysis introduces a direct TV-based error decomposition along the CTMC trajectory and a decoupling-based path-wise analysis for FHS, which may be of independent interest.

Problem

Research questions and friction points this paper is trying to address.

masked diffusion models

convergence rates

sampler analysis

total variation

score estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked diffusion models

total variation analysis

First-Hitting Sampler