Self-Speculative Masked Diffusions

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard masked diffusion models for discrete data generation suffer from inefficient sampling due to factorized approximations, necessitating numerous neural network forward passes (function evaluations) to ensure sample quality. This work introduces Speculative Masked Diffusion, a novel framework that overcomes the limitations of factorized modeling via non-factorized prediction and causal attention. Methodologically, it adapts the Transformer architecture by replacing non-causal attention masks with causal ones and integrating model-ensemble-based speculative sampling—enabling parallel draft token generation and sequential verification. Evaluated on GPT-2-scale text modeling and protein sequence generation, the approach reduces forward passes by approximately 50% while preserving sample fidelity. Our method establishes a more efficient and scalable paradigm for discrete diffusion modeling, advancing both computational efficiency and practical applicability in generative modeling of structured discrete sequences.

Technology Category

Application Category

📝 Abstract
We present self-speculative masked diffusions, a new class of masked diffusion generative models for discrete data that require significantly fewer function evaluations to generate samples. Standard masked diffusion models predict factorized logits over currently masked positions. A number of masked positions are then sampled, however, the factorization approximation means that sampling too many positions in one go leads to poor sample quality. As a result, many simulation steps and therefore neural network function evaluations are required to generate high-quality data. We reduce the computational burden by generating non-factorized predictions over masked positions. This is achieved by modifying the final transformer attention mask from non-causal to causal, enabling draft token generation and parallel validation via a novel, model-integrated speculative sampling mechanism. This results in a non-factorized predictive distribution over masked positions in a single forward pass. We apply our method to GPT2 scale text modelling and protein sequences generation, finding that we can achieve a ~2x reduction in the required number of network forward passes relative to standard masked diffusion models.
Problem

Research questions and friction points this paper is trying to address.

Reducing function evaluations in masked diffusion models for discrete data
Improving sample quality by generating non-factorized predictions over masked positions
Accelerating generation of text and protein sequences with fewer network passes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal transformer mask enables draft generation
Model-integrated speculative sampling for validation
Non-factorized masked predictions reduce computational steps
🔎 Similar Papers
No similar papers found.