Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion-based large language models (dLLMs) suffer from severe inference inefficiency due to bidirectional attention requiring frequent KV cache refreshes, interleaved prefilling and decoding phases, and fixed response lengths inducing redundant computation. Method: This paper proposes a dual-boundary collaborative acceleration framework featuring: (i) a phase-aware computational optimization mechanism integrating adaptive response length prediction and dLLM-specific skip-shared speculative decoding to dynamically reduce prefill overhead and drastically decrease decoding iterations; and (ii) an arithmetic-intensity-driven KV cache optimization strategy. Results: Experiments demonstrate that, while preserving model accuracy, the framework achieves 46–162× and 2.63–6.30× inference speedup over baseline dLLMs and Fast-dLLM, respectively—effectively overcoming key practical inference bottlenecks in dLLM deployment.

Technology Category

Application Category

📝 Abstract
Diffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162x and 2.63-6.30x speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.
Problem

Research questions and friction points this paper is trying to address.

Accelerates diffusion language models by reducing redundant prefill computation
Optimizes decoding phase with a jump-share speculative decoding method
Mitigates accuracy degradation while achieving significant inference speedups
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive length prediction reduces prefill overhead
Jump-share speculative decoding cuts decoding iterations
Orchestrates dual-boundaries for heterogeneous arithmetic intensity
🔎 Similar Papers
No similar papers found.
Linye Wei
Linye Wei
Peking University
Efficient AI System & Accelerator
W
Wenjue Chen
Peking University
P
Pingzhi Tang
Peking University
X
Xiaotian Guo
Peking University
L
Le Ye
Peking University
R
Runsheng Wang
Peking University
M
Meng Li
Peking University