Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Diffusion-based large language models (dLLMs) suffer from severe inference inefficiency due to bidirectional attention requiring frequent KV cache refreshes, interleaved prefilling and decoding phases, and fixed response lengths inducing redundant computation. Method: This paper proposes a dual-boundary collaborative acceleration framework featuring: (i) a phase-aware computational optimization mechanism integrating adaptive response length prediction and dLLM-specific skip-shared speculative decoding to dynamically reduce prefill overhead and drastically decrease decoding iterations; and (ii) an arithmetic-intensity-driven KV cache optimization strategy. Results: Experiments demonstrate that, while preserving model accuracy, the framework achieves 46–162× and 2.63–6.30× inference speedup over baseline dLLMs and Fast-dLLM, respectively—effectively overcoming key practical inference bottlenecks in dLLM deployment.

Technology Category

Application Category

📝 Abstract

Diffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162x and 2.63-6.30x speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.

Problem

Research questions and friction points this paper is trying to address.

Accelerates diffusion language models by reducing redundant prefill computation

Optimizes decoding phase with a jump-share speculative decoding method

Mitigates accuracy degradation while achieving significant inference speedups

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive length prediction reduces prefill overhead

Jump-share speculative decoding cuts decoding iterations

Orchestrates dual-boundaries for heterogeneous arithmetic intensity

🔎 Similar Papers

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion