Any-Order Flexible Length Masked Diffusion

πŸ“… 2025-08-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Masked diffusion models (MDMs) are constrained by fixed-length generation, limiting their applicability to flexible sequence modeling. To address this, we propose FlexMDMsβ€”the first diffusion model supporting variable-length discrete sequence generation. FlexMDMs enable non-autoregressive, parallel inference with arbitrary sequence lengths by dynamically inserting mask tokens and permitting denoising in arbitrary orders. Theoretically, we generalize the stochastic interpolation framework to accommodate both length variation and arbitrary denoising schedules. Engineering-wise, FlexMDMs seamlessly integrate with pre-trained MDMs, requiring only lightweight fine-tuning for adaptation. Empirically, FlexMDMs achieve ~60% higher success rates on maze planning; after three-day fine-tuning of LLaDA-8B, they outperform baselines significantly on GSM8K (58% β†’ 67%) and code completion (52% β†’ 65%), demonstrating both efficacy and practicality.

Technology Category

Application Category

πŸ“ Abstract
Masked diffusion models (MDMs) have recently emerged as a promising alternative to autoregressive models over discrete domains. MDMs generate sequences in an any-order, parallel fashion, enabling fast inference and strong performance on non-causal tasks. However, a crucial limitation is that they do not support token insertions and are thus limited to fixed-length generations. To this end, we introduce Flexible Masked Diffusion Models (FlexMDMs), a discrete diffusion paradigm that simultaneously can model sequences of flexible length while provably retaining MDMs' flexibility of any-order inference. Grounded in an extension of the stochastic interpolant framework, FlexMDMs generate sequences by inserting mask tokens and unmasking them. Empirically, we show that FlexMDMs match MDMs in perplexity while modeling length statistics with much higher fidelity. On a synthetic maze planning task, they achieve $approx 60 %$ higher success rate than MDM baselines. Finally, we show pretrained MDMs can easily be retrofitted into FlexMDMs: on 16 H100s, it takes only three days to fine-tune LLaDA-8B into a FlexMDM, achieving superior performance on math (GSM8K, $58% o 67%$) and code infilling performance ($52% o 65%$).
Problem

Research questions and friction points this paper is trying to address.

Overcoming fixed-length generation in masked diffusion models
Enabling token insertions for flexible sequence lengths
Retaining any-order inference flexibility while modeling length variability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flexible Masked Diffusion Models for variable length
Inserting mask tokens and unmasking them
Retrofitting pretrained MDMs into FlexMDMs via fine-tuning
πŸ”Ž Similar Papers
No similar papers found.
J
Jaeyeon Kim
Harvard University
L
Lee Cheuk-Kit
Harvard University, Kempner Institute
Carles Domingo-Enrich
Carles Domingo-Enrich
Microsoft Research
machine learninggenerative modeling
Yilun Du
Yilun Du
Harvard University
Artificial IntelligenceMachine LearningRoboticsComputer Vision
S
Sham Kakade
Harvard University, Kempner Institute
T
Timothy Ngotiaoco
Harvard University, Kempner Institute
Sitan Chen
Sitan Chen
Assistant Professor of Computer Science, Harvard University
theoretical computer sciencegenerative modelingquantum informationmathematics of data science
M
Michael Albergo
Harvard University, Kempner Institute, Institute for Artificial Intelligence and Fundamental Interactions, MIT