Accelerating Diffusion LLM Inference via Local Determinism Propagation

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion large language models (dLLMs) face a fundamental trade-off between output quality and inference speed: conservative decoding strategies (e.g., greedy decoding) ensure high fidelity but require multiple refinement steps, resulting in high decoding latency and low throughput. To address this, we propose LocalLeap—a training-free, adaptive parallel decoding framework. LocalLeap introduces a novel *local deterministic anchor guidance* mechanism that identifies high-confidence tokens for early commitment; integrates *local relaxed parallel decoding* with *spatial consistency decay modeling* to suppress error propagation. Departing from conventional single-step serial decoding, LocalLeap achieves near-lossless output quality—BLEU and ROUGE scores drop by less than 0.3%—while delivering a 6.94× throughput improvement and reducing the number of decoding steps to just 14.2% of the baseline. This significantly enhances dLLM inference efficiency without architectural or training modifications.

Technology Category

Application Category

📝 Abstract
Diffusion large language models (dLLMs) represent a significant advancement in text generation, offering parallel token decoding capabilities. However, existing open-source implementations suffer from quality-speed trade-offs that impede their practical deployment. Conservative sampling strategies typically decode only the most confident token per step to ensure quality (i.e., greedy decoding), at the cost of inference efficiency due to repeated redundant refinement iterations--a phenomenon we term delayed decoding. Through systematic analysis of dLLM decoding dynamics, we characterize this delayed decoding behavior and propose a training-free adaptive parallel decoding strategy, named LocalLeap, to address these inefficiencies. LocalLeap is built on two fundamental empirical principles: local determinism propagation centered on high-confidence anchors and progressive spatial consistency decay. By applying these principles, LocalLeap identifies anchors and performs localized relaxed parallel decoding within bounded neighborhoods, achieving substantial inference step reduction through early commitment of already-determined tokens without compromising output quality. Comprehensive evaluation on various benchmarks demonstrates that LocalLeap achieves 6.94$ imes$ throughput improvements and reduces decoding steps to just 14.2% of the original requirement, achieving these gains with negligible performance impact. The source codes are available at: https://github.com/friedrichor/LocalLeap.
Problem

Research questions and friction points this paper is trying to address.

Improving diffusion LLM inference efficiency
Reducing delayed decoding in parallel token generation
Maintaining quality while accelerating text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free adaptive parallel decoding strategy
Local determinism propagation around high-confidence anchors
Progressive spatial consistency decay for efficiency
Fanheng Kong
Fanheng Kong
Northeastern University; Kuaishou Technology
Multimodal LLMMultimodal Understanding
J
Jingyuan Zhang
Klear Team, Kuaishou Technology
Y
Yahui Liu
Klear Team, Kuaishou Technology
Z
Zirui Wu
Klear Team, Kuaishou Technology
Y
Yu Tian
Klear Team, Kuaishou Technology
V
Victoria W.
Klear Team, Kuaishou Technology
Guorui Zhou
Guorui Zhou
Unknown affiliation
Recommender System,Advertising,Artificial Intelligence,Machine Learning,NLP