DECA: A Near-Core LLM Decompression Accelerator Supporting Out-of-Order Invocation

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

269K/year

🤖 AI Summary

In LLM inference, weight quantization and sparsity introduce decompression (dequantization/desparsification) as a critical HBM bandwidth bottleneck; existing software implementations are inefficient, and GeMM performance is hindered by misalignment among memory, vector units, and matrix engines. This paper proposes a near-core decompression accelerator that dynamically decompresses weight blocks adjacent to compute cores, delivering ready-to-use data directly to on-chip GeMM engines. We introduce the first 3D co-analysis model to characterize memory–vector–matrix engine interactions, and design a custom ISA extension supporting out-of-order decompression invocation to enable deep overlap of decompression and computation. Evaluated on a 56-core Xeon+HBM simulation platform, our approach achieves 4× GeMM speedup and reduces single-token generation latency by 1.6–2.6× for Llama2-70B and OPT-66B.

Technology Category

Application Category

📝 Abstract

To alleviate the memory bandwidth bottleneck in Large Language Model (LLM) inference workloads, weight matrices are stored in memory in quantized and sparsified formats. Hence, before tiles of these matrices can be processed by in-core generalized matrix multiplication (GeMM) hardware engines, they need to be dequantized and de-sparsified. This is currently performed in software with vector operations. Unfortunately, this approach delivers only modest performance. Moreover, it is hard to understand how to improve the system, as the overall GeMM performance depends on the interaction between memory resources, vector units, and hardware matrix engines. To improve the performance of LLM inference in advanced platforms equipped with in-core GeMM engines and HBM, this paper makes three main contributions. First, it develops an analytical performance model with a 3D visual representation that provides insights into how memory resources, vector units, and hardware matrix engines interact to deliver compressed GeMM performance. Second, it proposes DECA, a new near-core ML-model decompression accelerator. DECA offloads tile de-sparsification and dequantization from the CPU, producing ready-to-use tiles for in-core GeMM engines. Third, it introduces a new ISA extension that enables out-of-order invocation of the near-core accelerator. With this extension, accelerator and core computations can interleave and overlap with high-performance. Our evaluation shows that, in a simulated 56-core Xeon 4 server with HBM, DECA accelerates the execution of compressed GeMMs by up to 4x over the use of optimized Intel software kernels. Further, DECA reduces the next-token generation time of Llama2-70B and OPT-66B by 1.6x-2.6x.

Problem

Research questions and friction points this paper is trying to address.

Alleviating memory bandwidth bottleneck in LLM inference

Improving performance of dequantization and de-sparsification

Optimizing interaction between memory, vector units, and matrix engines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analytical model for compressed GeMM performance

Near-core decompression accelerator DECA

ISA extension for out-of-order invocation

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Principal Machine Learning Engineer

Red Hat

$189,600.00 - $312,730.00

Boston

Authors to Follow