🤖 AI Summary
Diffusion-based large language models (dLLMs) face severe memory pressure during production inference—primarily due to massive logit tensors and highly volatile memory demands across the “flush” and “reuse” phases of the diffusion process. Existing approaches focus narrowly on kernel-level optimizations, lacking an end-to-end serving framework tailored to dLLMs’ intrinsic memory dynamics.
Method: We propose the first dLLM-specific inference system, integrating three co-designed techniques: (1) Logit-Aware activation budgeting, (2) Phase-Multiplexed heterogeneous scheduling, and (3) Head-Centric sparse attention—mapping algorithmic sparsity directly onto hardware acceleration. We further introduce memory-aware tensor decomposition and a logical-physical decoupled sparse mechanism.
Contribution/Results: Our system achieves 1.60×–1.81× higher throughput and reduces high-load tail latency by nearly 4× on RTX 4090 and L40S GPUs, while maintaining output quality and demonstrating cross-hardware scalability.
📝 Abstract
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memory dynamics of diffusion processes in production. We identify a critical "memory footprint crisis" specific to dLLMs, driven by monolithic logit tensors and the severe resource oscillation between compute-bound "Refresh" phases and bandwidth-bound "Reuse" phases. To bridge this gap, we present dLLM-Serve, an efficient dLLM serving system that co-optimizes memory footprint, computational scheduling, and generation quality. dLLM-Serve introduces Logit-Aware Activation Budgeting to decompose transient tensor peaks, a Phase-Multiplexed Scheduler to interleave heterogeneous request phases, and Head-Centric Sparse Attention to decouple logical sparsity from physical storage. We evaluate dLLM-Serve on diverse workloads (LiveBench, Burst, OSC) and GPUs (RTX 4090, L40S). Relative to the state-of-the-art baseline, dLLM-Serve improves throughput by 1.61$ imes$-1.81$ imes$ on the consumer-grade RTX 4090 and 1.60$ imes$-1.74$ imes$ on the server-grade NVIDIA L40S, while reducing tail latency by nearly 4$ imes$ under heavy contention. dLLM-Serve establishes the first blueprint for scalable dLLM inference, converting theoretical algorithmic sparsity into tangible wall-clock acceleration across heterogeneous hardware.