Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Parallel token sampling confronts a fundamental tension between conditional independence and high-confidence prediction. This paper proposes PUNT—a model-agnostic masked diffusion sampling method—that dynamically identifies subsets of tokens amenable to parallel update by jointly modeling statistical conditional independence tests and prediction confidence. Its core innovation lies in the first integration of conditional independence detection into discrete text generation sampling, enabling hierarchical decoding (structural initialization followed by refinement), without requiring model fine-tuning or additional training. On the IFEval benchmark, PUNT improves accuracy by 16% over baselines, demonstrates pronounced advantages for long-sequence generation, exhibits robustness to hyperparameter variation, and effectively improves the accuracy–efficiency trade-off.

Technology Category

Application Category

📝 Abstract
Masked diffusion models (MDMs) offer a compelling alternative to autoregressive models (ARMs) for discrete text generation because they enable parallel token sampling, rather than sequential, left-to-right generation. This means potentially much faster inference. However, effective parallel sampling faces two competing requirements: (i) simultaneously updated tokens must be conditionally independent, and (ii) updates should prioritise high-confidence predictions. These goals conflict because high-confidence predictions often cluster and depend on each other, opportunities for parallel updates. We present PUNT, a model-agnostic sampler that reconciles this trade-off. Our method identifies token dependencies and removes lower-confidence tokens from conflicting groups. This produces sets of indices for unmasking that satisfy both independence and confidence criteria. Our approach ensures improved parallel unmasking through approximate conditional independence testing. Our experiments show that PUNT delivers a superior trade-off between accuracy and compute when compared to other strong training-free baselines, especially for generation of longer sequences. On the IFEval benchmark, it achieves up to 16% higher accuracy over baseline methods, including sequential generation (one-by-one). These gains hold across different values of hyperparameters, mitigating the need for brittle hyperparameter tuning. Moreover, we observe that PUNT induces an emergent hierarchical generation strategy, where the model first establishes high-level paragraph structure before local refinement, suggesting a planning-like generation process that contributes to strong alignment performance.
Problem

Research questions and friction points this paper is trying to address.

Enabling parallel token sampling in masked diffusion models
Resolving conflict between conditional independence and prediction confidence
Improving accuracy-compute tradeoff for long sequence generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies token dependencies via conditional independence testing
Removes low-confidence tokens from conflicting parallel groups
Enables hierarchical generation with paragraph structure planning
🔎 Similar Papers
No similar papers found.
I
Iskander Azangulov
Department of Statistics, University of Oxford
Teodora Pandeva
Teodora Pandeva
Microsoft Research
identifiable representation leartningsequential testingnonparametric inference
N
Niranjani Prasad
Microsoft Research, Cambridge
J
Javier Zazo
Microsoft Research, Cambridge
Sushrut Karmalkar
Sushrut Karmalkar
Microsoft Research
AlgorithmsComplexityRegressionRobustness