Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Diffusion-based large language models (dLLMs) face severe memory pressure during production inference—primarily due to massive logit tensors and highly volatile memory demands across the “flush” and “reuse” phases of the diffusion process. Existing approaches focus narrowly on kernel-level optimizations, lacking an end-to-end serving framework tailored to dLLMs’ intrinsic memory dynamics. Method: We propose the first dLLM-specific inference system, integrating three co-designed techniques: (1) Logit-Aware activation budgeting, (2) Phase-Multiplexed heterogeneous scheduling, and (3) Head-Centric sparse attention—mapping algorithmic sparsity directly onto hardware acceleration. We further introduce memory-aware tensor decomposition and a logical-physical decoupled sparse mechanism. Contribution/Results: Our system achieves 1.60×–1.81× higher throughput and reduces high-load tail latency by nearly 4× on RTX 4090 and L40S GPUs, while maintaining output quality and demonstrating cross-hardware scalability.

Technology Category

Application Category

📝 Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memory dynamics of diffusion processes in production. We identify a critical "memory footprint crisis" specific to dLLMs, driven by monolithic logit tensors and the severe resource oscillation between compute-bound "Refresh" phases and bandwidth-bound "Reuse" phases. To bridge this gap, we present dLLM-Serve, an efficient dLLM serving system that co-optimizes memory footprint, computational scheduling, and generation quality. dLLM-Serve introduces Logit-Aware Activation Budgeting to decompose transient tensor peaks, a Phase-Multiplexed Scheduler to interleave heterogeneous request phases, and Head-Centric Sparse Attention to decouple logical sparsity from physical storage. We evaluate dLLM-Serve on diverse workloads (LiveBench, Burst, OSC) and GPUs (RTX 4090, L40S). Relative to the state-of-the-art baseline, dLLM-Serve improves throughput by 1.61$ imes$-1.81$ imes$ on the consumer-grade RTX 4090 and 1.60$ imes$-1.74$ imes$ on the server-grade NVIDIA L40S, while reducing tail latency by nearly 4$ imes$ under heavy contention. dLLM-Serve establishes the first blueprint for scalable dLLM inference, converting theoretical algorithmic sparsity into tangible wall-clock acceleration across heterogeneous hardware.

Problem

Research questions and friction points this paper is trying to address.

Addresses memory footprint crisis in diffusion LLM serving

Optimizes computational scheduling for heterogeneous request phases

Enhances generation quality while reducing latency and resource use

Innovation

Methods, ideas, or system contributions that make the work stand out.

Logit-Aware Activation Budgeting to decompose transient tensor peaks

Phase-Multiplexed Scheduler to interleave heterogeneous request phases

Head-Centric Sparse Attention to decouple logical sparsity from physical storage

🔎 Similar Papers

No similar papers found.