Scalable MatMul-free Language Modeling

πŸ“… 2024-06-04
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 22
✨ Influential: 3
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) suffer from prohibitive computational and memory overhead due to their reliance on matrix multiplication (MatMul). This work introduces the first fully MatMul-free billion-parameter language model architecture, replacing MatMul with lightweight nonlinear operators. At 2.7B parameters, it achieves language modeling performance on par with state-of-the-art Transformers. We empirically discover a novel convergence scaling law governing performance as parameter count increases. To enable efficient deployment, we design GPU-optimized kernels and an FPGA-customized accelerator, integrated into a memory-aware training and inference framework. Experiments show over 10Γ— reduction in inference memory footprint, 61% lower training memory consumption, and FPGA power dissipation of only 13Wβ€”achieving human-readable throughput-level energy efficiency. Our approach establishes a new paradigm for low-power, highly scalable LLMs.

Technology Category

Application Category

πŸ“ Abstract
Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model's memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at https://github.com/ridgerchu/matmulfreellm.
Problem

Research questions and friction points this paper is trying to address.

Eliminate MatMul in LLMs to reduce computation and memory
Maintain performance in billion-parameter models without MatMul
Achieve energy-efficient, high-throughput processing for lightweight LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Eliminate MatMul in LLMs for efficiency
GPU-efficient with 61% less memory
Neuromorphic system boosts throughput 4x
πŸ”Ž Similar Papers
No similar papers found.
Rui-Jie Zhu
Rui-Jie Zhu
Ph.D. Student, University of California, Santa Cruz
Brain-Inspired EngineeringLanguage Modeling
Y
Yu Zhang
Soochow University
E
Ethan Sifferman
University of California, Santa Cruz
T
Tyler Sheaves
University of California, Davis
Y
Yiqiao Wang
LuxiTech
D
Dustin Richmond
University of California, Santa Cruz
P
Peng Zhou
University of California, Santa Cruz and LuxiTech
Jason K. Eshraghian
Jason K. Eshraghian
University of California, Santa Cruz, Assistant Professor
lightweight machine learningneuromorphic computingspiking neural networks