dParallel: Learnable Parallel Decoding for dLLMs

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Diffusion large language models (dLLMs) hold promise for parallel decoding, yet existing methods require hundreds of iterative steps to ensure generation quality, severely undermining their efficiency advantage. This work proposes dParallel—a learnable parallel decoding framework—that identifies slow convergence of mask-token ordering confidence as the key bottleneck. To address this, we introduce certainty-forcing distillation, a novel training strategy integrating trajectory consistency constraints with a parallel confidence-boosting mechanism, significantly accelerating confidence convergence. On LLaDA-8B-Instruct, dParallel reduces decoding steps on GSM8K from 256 to 30 (8.5× speedup) and on MBPP from 256 to 24 (10.5× speedup), without performance degradation. To our knowledge, this is the first work achieving high-accuracy parallel sampling in dLLMs within merely tens of steps, establishing a new paradigm for efficient text generation.

Technology Category

Application Category

📝 Abstract

Diffusion large language models (dLLMs) have recently drawn considerable attention within the research community as a promising alternative to autoregressive generation, offering parallel token prediction and lower inference latency. Yet, their parallel decoding potential remains largely underexplored, as existing open-source models still require nearly token-length decoding steps to ensure performance. To address this, we introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments across various benchmarks demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup while maintaining accuracy. Our code is available at https://github.com/czg1225/dParallel

Problem

Research questions and friction points this paper is trying to address.

Unlocking parallel decoding potential in diffusion language models

Reducing sequential certainty convergence for masked tokens

Maintaining performance while dramatically cutting decoding steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable parallel decoding for diffusion language models

Certainty-forcing distillation accelerates token certainty

Reduces decoding steps while maintaining model performance

🔎 Similar Papers

No similar papers found.