SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Memory-bound token-by-token decoding severely limits large language model (LLM) inference efficiency on CPU platforms. Method: This paper proposes a co-optimization framework leveraging Intel Advanced Matrix Extensions (AMX) and unstructured sparsity—introducing unstructured sparsity into attention computation for the first time, designing hardware-aware sparse GEMM and sparse attention kernels, and building an open-source sparse kernel library that enables end-to-end, automatic replacement of PyTorch linear layers. Contribution/Results: Evaluated against native PyTorch, our approach achieves 1.42× end-to-end decoding latency reduction and 1.14× acceleration in attention computation, with zero accuracy loss. The solution is fully compatible with standard PyTorch workflows and requires no model retraining. By eliminating memory bottlenecks without sacrificing precision, this work delivers a practical, system-level optimization for energy-efficient and cost-effective LLM deployment on commodity CPUs.

Technology Category

Application Category

📝 Abstract

Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with CPUs enables broader AI access at a lower cost and power consumption. This acceleration potential for CPUs is especially relevant during the memory-bound decoding stage of LLM inference, which processes one token at a time and is becoming increasingly utilized with reasoning models. We utilize Advanced Matrix Extensions (AMX) support on the latest Intel CPUs together with unstructured sparsity to achieve a $1.42 imes$ reduction in end-to-end latency compared to the current PyTorch implementation by applying our technique in linear layers. We provide a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically replacing all linear layers with our custom sparse implementation. Furthermore, we demonstrate for the first time the use of unstructured sparsity in the attention computation achieving a $1.14 imes$ speedup over the current systems without compromising accuracy. Code: https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SparAMX

Problem

Research questions and friction points this paper is trying to address.

Accelerate LLMs on CPUs

Reduce latency and power consumption

Leverage AMX and unstructured sparsity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes AMX on Intel CPUs

Implements unstructured sparsity in layers

Custom sparse kernels for PyTorch

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models