Fast Matrix Multiplications for Lookup Table-Quantized LLMs

πŸ“… 2024-07-15
πŸ›οΈ Conference on Empirical Methods in Natural Language Processing
πŸ“ˆ Citations: 8
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
Large language model (LLM) inference is bottlenecked by GPU memory bandwidth, particularly under non-uniform low-bit (e.g., 3-bit) lookup table (LUT) quantization, where fused dequantization and matrix multiplication suffers from poor computational efficiency. To address this, we propose FLUTE, a novel inference engine featuring the first CUDA kernel supporting non-divisible-bit-width LUT quantization. FLUTE reduces bit-manipulation overhead via offline weight reconstruction and alleviates shared-memory bandwidth pressure through LUT vectorization and redundant loading. It natively supports weight-only quantization, NormalFloat extensions, and bandwidth-aware scheduling. Experiments show that, under batch size < 32 and group size = 128, FLUTE’s kernel achieves 2–4Γ— speedup over state-of-the-art GEMM kernels. When integrated into the LLaMA-3 quantization backend, FLUTE delivers 1.5–2Γ— end-to-end throughput improvement while preserving accuracy competitive with strong baselines.

Technology Category

Application Category

πŸ“ Abstract
The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup table to mitigate shared memory bandwidth constraints. At batch sizes<32 and quantization group size of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster than existing GEMM kernels. As an application of FLUTE, we explore a simple extension to lookup table-based NormalFloat quantization and apply it to quantize LLaMA3 to various configurations, obtaining competitive quantization performance against strong baselines while obtaining an end-to-end throughput increase of 1.5 to 2 times.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Low-bit Quantization
Efficient Matrix Computations
Innovation

Methods, ideas, or system contributions that make the work stand out.

FLUTE
Low-bit Quantization
Speed-up Optimization
πŸ”Ž Similar Papers
No similar papers found.
H
Han Guo
Massachusetts Institute of Technology
W
William Brandon
Massachusetts Institute of Technology
R
Radostin Cholakov
High School of Mathematics Plovdiv
Jonathan Ragan-Kelley
Jonathan Ragan-Kelley
MIT CSAIL
Computer GraphicsCompilersComputer ArchitectureProgramming LanguagesImage Processing
E
Eric P. Xing
Carnegie Mellon University, MBZUAI, Petuum Inc.
Yoon Kim
Yoon Kim
Associate Professor, MIT
Machine LearningNatural Language ProcessingDeep Learning