🤖 AI Summary
Masked diffusion models suffer from low sampling efficiency and weak theoretical interpretability, hindering practical deployment. This paper uncovers an implicit temperature-based sampling mechanism in MaskGIT and proposes an analytically tractable “moment sampler” following a “localize-then-sample” paradigm. We further introduce a partial caching strategy and an adaptive, non-uniform mask removal schedule grounded in exploration-exploitation trade-offs, enabling dynamic optimization of the denoising trajectory. Built upon the Transformer architecture, our method unifies position selection and token generation within a single modeling framework. Extensive experiments on image and text generation demonstrate substantial acceleration—up to 3.2× fewer sampling steps—while preserving or improving generation quality. Crucially, the approach provides a rigorous theoretical analysis framework. This work establishes a new paradigm for efficient and interpretable masked diffusion modeling.
📝 Abstract
Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we introduce the "moment sampler," an asymptotically equivalent but more tractable and interpretable alternative to MaskGIT, which employs a "choose-then-sample" approach by selecting unmasking positions before sampling tokens. In addition, we improve the efficiency of choose-then-sample algorithms through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.