SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address accuracy collapse and inference instability in ultra-low-bit (1–2 bit) quantization of large language models (LLMs), this paper proposes the Sigma-Delta Quantization framework. It integrates Hadamard transform preprocessing to enhance weight smoothness; introduces a continuously tunable oversampling ratio (OSR) strategy coupled with a MultiOSR hierarchical allocation mechanism, enabling fine-grained, variance-driven dynamic OSR configuration; and replaces floating-point matrix multiplication in linear layers with upsampling followed by Sigma-Delta modulation to achieve binary or ternary weight representation. Evaluated on OPT and LLaMA series models, the method maintains near-full-precision inference accuracy even under low OSR (≤4). It significantly improves computational efficiency, inference stability, and deployment flexibility for ultra-low-bit LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) face significant computational and memory challenges, making extremely low-bit quantization crucial for their efficient deployment. In this work, we introduce SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size, a novel framework that enables extremely low-bit quantization of LLMs while preserving their linguistic reasoning capabilities. A distinctive feature of SDQ-LLM is the continuous adjustability of the Over-Sampling Ratio (OSR), enabling dynamic adaptation to memory or VRAM constraints by selecting fractional OSR (e.g. 2.5 times) for an optimal trade-off between model size and accuracy. SDQ-LLM uses upsampling combined with Sigma-Delta Quantizer to binarize or ternarize LLMs weights, encoding high-precision parameters into 1-bit or 1.58-bit representations, replacing the multiplication operations within linear layers with addition. This approach significantly enhances inference efficiency under extremely low-bit quantization. To further reduce the loss of quantization precision, we incorporate Hadamard-based weight smoothing prior to quantization, improving the stability and robustness of the weight representations. Furthermore, to fully leverage the continuity of the OSR and reduce precision loss, recognizing the correlation between quantization sensitivity and weight variance, we propose a fine-grained, layer- and linear-wise OSR allocation strategy, MultiOSR. This strategy distributes OSR both across layers and within each layer, based on weight variance and parameter scale. Finally, extensive experiments on OPT and LLaMA model families demonstrate that SDQ-LLM achieves a more efficient and high-precision performance even under highly aggressive low-OSR settings. Our code is available at https://github.com/Dreamlittlecat/LLM-Quant-Factory.

Problem

Research questions and friction points this paper is trying to address.

Enables 1-bit quantization of large language models while preserving reasoning capabilities

Addresses computational and memory challenges in LLM deployment through extreme quantization

Dynamically adapts quantization to memory constraints via adjustable over-sampling ratios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Sigma-Delta Quantizer for 1-bit LLM binarization

Incorporates Hadamard-based smoothing for quantization stability

Proposes fine-grained OSR allocation strategy for precision

🔎 Similar Papers

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms