🤖 AI Summary
This work addresses the inefficiency of sampling in diffusion-based large language models (dLLMs) on conventional GEMM-centric NPUs, where high memory overhead and irregular memory access patterns lead to sampling latency accounting for up to 70% of total inference delay. The study systematically identifies, for the first time, the essential non-GEMM instruction set required for dLLM sampling and proposes a sampling-oriented NPU microarchitecture that departs from the GEMM-centric paradigm. By integrating lightweight vector primitives, in-place memory reuse, and a decoupled mixed-precision memory hierarchy, the design significantly improves sampling efficiency. Evaluated under equivalent process technology, the proposed architecture achieves up to 2.53× speedup over an NVIDIA RTX A6000 GPU. To ensure reproducibility and functional correctness, the authors open-source a cycle-accurate simulator and RTL implementation.
📝 Abstract
Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary-wide logits, reduction-based token selection, and iterative masked updates. These processes demand large on-chip SRAM and involve irregular memory accesses that conventional NPUs struggle to handle efficiently. To address this, we identify a set of critical instructions that an NPU architecture must specifically optimize for dLLM sampling. Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy. Together, these optimizations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent nm technology node. We also open-source our cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations.