🤖 AI Summary
This work addresses the degradation in generation quality observed in parallel masked diffusion language models due to a mismatch between their training objective and the requirement for global sequence coherence during multi-token synchronous generation. To resolve this, the authors propose the ME-DLM framework, which introduces a lightweight post-optimization step after parallel diffusion generates a complete sequence. This step applies minimal edit operations—substitution, deletion, and insertion—guided by edit-distance-based supervision to explicitly model sequence-level consistency under global context. Remarkably, ME-DLM achieves substantial performance gains while requiring only one-eighth of the original diffusion steps, yielding improvements of 11.6 and 33.6 points on HumanEval and GSM8K benchmarks, respectively, thereby significantly enhancing both the quality and robustness of parallel multi-token generation.
📝 Abstract
Masked diffusion language models enable parallel token generation and offer improved decoding efficiency over autoregressive models. However, their performance degrades significantly when generating multiple tokens simultaneously, due to a mismatch between token-level training objectives and joint sequence consistency. In this paper, we propose ME-DLM, an edit-based refinement framework that augments diffusion generation with lightweight post-editing steps. After producing an initial complete response, the model refines it through minimal edit operations, including replacement, deletion, and insertion, conditioned on the full sequence. Training supervision is derived from edit distance, providing a deterministic signal under a fixed canonicalization scheme for learning minimal corrections. This approach encourages sequence-level consistency through globally conditioned edits while preserving the efficiency benefits of parallel diffusion decoding. Extensive experiments demonstrate that ME-DLM improves the quality and robustness of multi-token parallel generation. In particular, when built upon LLaDA, our method achieves consistent gains of 11.6 points on HumanEval and 33.6 points on GSM8K while using one-eighth of the total diffusion steps. Code is available at https://github.com/renhouxing/ME-DLM.