any4: Learned 4-bit Numeric Representation for LLMs

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

To address low accuracy and reliance on preprocessing and calibration data in 4-bit weight quantization of large language models (LLMs), this paper proposes *any4*—an end-to-end learnable 4-bit numerical representation scheme. *any4* eliminates the need for weight or activation preprocessing, supports arbitrary numeric encodings, and enables parameter optimization via single-sample efficient calibration. Its core innovations are a learnable quantization framework and a GPU-friendly lookup-table-based matrix multiplication, already integrated into the open-source library *tinygemm*. Evaluated across diverse architectures—including Llama 2/3, Mistral, and Mixtral—*any4* consistently outperforms state-of-the-art 4-bit (and lower-bit) quantization methods such as INT4, FP4, NF4, AWQ, and GPTQ, achieving higher model accuracy without sacrificing inference efficiency. This work establishes a general, tuning-free paradigm for ultra-low-bit LLM deployment.

Technology Category

Application Category

📝 Abstract

We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4 .

Problem

Research questions and friction points this paper is trying to address.

Learned 4-bit quantization for LLMs without preprocessing

Improves accuracy over int4, fp4, nf4 in various models

Competes with AWQ, GPTQ despite no weight preprocessing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learned 4-bit weight quantization for LLMs

No pre-processing of weights or activations

GPU-efficient lookup table implementation

🔎 Similar Papers

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms