T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

📅 2024-06-25

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 1

career value

197K/year

🤖 AI Summary

Existing CPU systems lack native support for mixed-precision GEMM (mpGEMM), necessitating costly dequantization of low-bit weights during inference of quantized large language models (LLMs) on edge devices. To address this, we propose T-MAC—a novel compute paradigm leveraging bit-wise lookup tables (LUTs) to replace conventional multiply-accumulate operations, enabling direct mpGEMM between low-bit weights and high-precision activations without dequantization. The LUT-based kernel scales linearly with weight bit-width and supports multiplication-free, addition-light, unified, and scalable computation. Evaluated on M2-Ultra (single-core/eight-core) and Raspberry Pi 5, T-MAC achieves up to 30/71 and 11 tokens/sec, respectively—delivering a 4× throughput improvement and 70% energy reduction over llama.cpp.

Technology Category

Application Category

📝 Abstract

The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution. Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency. The system is open-sourced at https://github.com/microsoft/T-MAC.

Problem

Research questions and friction points this paper is trying to address.

Enables efficient low-bit LLM inference on CPUs

Eliminates need for weight dequantization in mpGEMM

Reduces computation overhead and energy consumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

LUT-based mpGEMM without dequantization overhead

Bit-wise table lookup replaces multiplications

Scalable kernel design for variable bit-widths

🔎 Similar Papers

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge