Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of sampling in diffusion-based large language models (dLLMs) on conventional GEMM-centric NPUs, where high memory overhead and irregular memory access patterns lead to sampling latency accounting for up to 70% of total inference delay. The study systematically identifies, for the first time, the essential non-GEMM instruction set required for dLLM sampling and proposes a sampling-oriented NPU microarchitecture that departs from the GEMM-centric paradigm. By integrating lightweight vector primitives, in-place memory reuse, and a decoupled mixed-precision memory hierarchy, the design significantly improves sampling efficiency. Evaluated under equivalent process technology, the proposed architecture achieves up to 2.53× speedup over an NVIDIA RTX A6000 GPU. To ensure reproducibility and functional correctness, the authors open-source a cycle-accurate simulator and RTL implementation.

Technology Category

Application Category

📝 Abstract
Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary-wide logits, reduction-based token selection, and iterative masked updates. These processes demand large on-chip SRAM and involve irregular memory accesses that conventional NPUs struggle to handle efficiently. To address this, we identify a set of critical instructions that an NPU architecture must specifically optimize for dLLM sampling. Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy. Together, these optimizations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent nm technology node. We also open-source our cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations.
Problem

Research questions and friction points this paper is trying to address.

Diffusion LLM
sampling latency
memory access
non-GEMM operations
NPU inefficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion LLM
non-GEMM acceleration
in-place memory reuse
mixed-precision memory hierarchy
NPU architecture
🔎 Similar Papers
No similar papers found.
B
Binglei Lou
Imperial College London
H
Haoran Wu
University of Cambridge, UK
Yao Lai
Yao Lai
HKU | UT Austin
J
Jiayi Nie
University of Cambridge, UK
C
Can Xiao
Imperial College London
X
Xuan Guo
Imperial College London
R
Rika Antonova
University of Cambridge, UK
Robert Mullins
Robert Mullins
Department of Computer Science and Technology, University of Cambridge
Computer Science - Computer Architecture - On-Chip Interconnection Networks
A
Aaron Zhao
Imperial College London