FBQuant: FeedBack Quantization for Large Language Models

📅 2025-01-25

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Deploying large language models (LLMs) on edge devices faces challenges with ultra-low-bit quantization (e.g., 3-bit), which typically incurs substantial accuracy degradation and poor hardware compatibility. To address this, we propose the first weight quantization framework grounded in automatic control theory—specifically, negative feedback principles—dynamically constraining the reconstructed weight range to enhance quantization robustness. We further design a hardware-coordinated sub-branch error compensation architecture and develop customized CUDA kernels for efficient inference. Evaluated on Llama2-7B, our 3-bit quantization achieves a 1.2% improvement in zero-shot accuracy and reduces inference latency by 60%, significantly outperforming state-of-the-art methods. The core contribution lies in introducing closed-loop negative feedback into quantization design, jointly optimizing accuracy, stability, and on-device deployment efficiency.

Technology Category

Application Category

📝 Abstract

Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to the limited computational resources of edge devices. In particular, the key bottleneck stems from memory bandwidth constraints related to weight loading. Weight-only quantization effectively reduces memory access, yet often induces significant accuracy degradation. Recent efforts to incorporate sub-branches have shown promise for mitigating quantization errors, but these methods either lack robust optimization strategies or rely on suboptimal objectives. To address these gaps, we propose FeedBack Quantization (FBQuant), a novel approach inspired by negative feedback mechanisms in automatic control. FBQuant inherently ensures that the reconstructed weights remain bounded by the quantization process, thereby reducing the risk of overfitting. To further offset the additional latency introduced by sub-branches, we develop an efficient CUDA kernel that decreases 60% of extra inference time. Comprehensive experiments demonstrate the efficiency and effectiveness of FBQuant across various LLMs. Notably, for 3-bit Llama2-7B, FBQuant improves zero-shot accuracy by 1.2%.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Computational Efficiency

Memory Constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feedback Quantization

Efficient CUDA Kernels

Large Language Model Optimization

🔎 Similar Papers

No similar papers found.