Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of efficient inference support for ternary large language models (e.g., BitNet b1.58) on edge devices. We present the first lossless sub-2-bit-per-weight real-time inference system tailored for edge deployment. Our method introduces: (1) a novel hybrid-precision GEMM scheme combining ternary lookup tables (TL) and scaled Int2 (I2_S), balancing numerical fidelity and memory efficiency; (2) a general-purpose, low-bit element-wise lookup table (ELUT) technique enabling arbitrary low-bit nonlinear mappings; and (3) hardware-aware C++ optimizations targeting edge accelerators. Experiments show up to 6.25× speedup over full-precision baselines and up to 2.32× improvement over state-of-the-art low-bit approaches—while incurring zero accuracy degradation. The implementation is open-sourced and supports plug-and-play deployment.

Technology Category

Application Category

📝 Abstract
The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.
Problem

Research questions and friction points this paper is trying to address.

Efficient edge inference for ternary LLMs
Optimized mpGEMM for sub-2-bits-per-weight inference
High-speed, lossless edge inference solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized inference for ternary LLMs
Novel mpGEMM library for efficiency
Ternary Lookup Table for spatial efficiency
🔎 Similar Papers
No similar papers found.
J
Jinheng Wang
Peking University, Microsoft Research
H
Hansong Zhou
Peking University, Microsoft Research
T
Ting Song
Microsoft Research
Shijie Cao
Shijie Cao
Microsoft Research Asia
Efficient Deep LearningDeep Learning SystemComputer Architecture
Y
Yan Xia
Microsoft Research
T
Ting Cao
Microsoft Research
Jianyu Wei
Jianyu Wei
USTC & MSRA Joint PhD
LLM InfraInference SystemQuantizationKernelCo-design
Shuming Ma
Shuming Ma
Microsoft Research Asia
Natural language processingdeep learning
H
Hongyu Wang
University of Chinese Academy of Sciences, Microsoft Research
Furu Wei
Furu Wei
Distinguished Scientist, Microsoft Research
Natural Language ProcessingArtificial IntelligenceGeneral AIGenerative AIMultimodal AI