Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the inefficiency of sampling in diffusion-based large language models (dLLMs) on conventional GEMM-centric NPUs, where high memory overhead and irregular memory access patterns lead to sampling latency accounting for up to 70% of total inference delay. The study systematically identifies, for the first time, the essential non-GEMM instruction set required for dLLM sampling and proposes a sampling-oriented NPU microarchitecture that departs from the GEMM-centric paradigm. By integrating lightweight vector primitives, in-place memory reuse, and a decoupled mixed-precision memory hierarchy, the design significantly improves sampling efficiency. Evaluated under equivalent process technology, the proposed architecture achieves up to 2.53× speedup over an NVIDIA RTX A6000 GPU. To ensure reproducibility and functional correctness, the authors open-source a cycle-accurate simulator and RTL implementation.

Technology Category

Application Category

📝 Abstract

Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary-wide logits, reduction-based token selection, and iterative masked updates. These processes demand large on-chip SRAM and involve irregular memory accesses that conventional NPUs struggle to handle efficiently. To address this, we identify a set of critical instructions that an NPU architecture must specifically optimize for dLLM sampling. Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy. Together, these optimizations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent nm technology node. We also open-source our cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations.

Problem

Research questions and friction points this paper is trying to address.

Diffusion LLM

sampling latency

memory access

non-GEMM operations

NPU inefficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion LLM

non-GEMM acceleration

in-place memory reuse