DMax: Aggressive Parallel Decoding for dLLMs

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the degradation in generation quality of diffusion language models under aggressive parallel decoding, which stems from error accumulation. The authors propose a novel paradigm that reframes decoding as a progressive self-optimization process mapping from masked embeddings to token embeddings. Key innovations include an On-Policy Uniform Training strategy, a Soft Parallel Decoding mechanism, and iterative self-correction in embedding space coupled with mask-token interpolation representations. This approach substantially enhances parallel efficiency, achieving throughput factors (TPF) of 5.47 on GSM8K and 5.86 on MBPP while preserving high accuracy, and attains a throughput of 1,338 tokens per second (TPS) in single-batch inference.

Technology Category

Application Category

📝 Abstract

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax

Problem

Research questions and friction points this paper is trying to address.

diffusion language models

parallel decoding

error accumulation

generation quality

decoding efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion language models

parallel decoding

On-Policy Uniform Training