Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Quantization and pruning—two key model compression techniques for large language models (LLMs)—exhibit conflicting distributional preferences: quantization favors compact weight distributions, whereas pruning benefits from high-variance ones. Method: This paper proposes Optimal Brain Restoration (OBR), a unified, training-free framework that jointly optimizes quantization and sparsification. Leveraging Hessian-based second-order information, OBR formulates a principled objective and introduces proxy approximations along with grouped second-order gradient error compensation, enabling derivation of a closed-form solution for synergistic alignment. Contribution/Results: Under W4A4KV4 quantization combined with 50% structured sparsity, OBR achieves up to 4.72× inference speedup and 6.4× memory reduction over FP16 dense models, significantly improving compression ratio and deployment efficiency without retraining.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.

Problem

Research questions and friction points this paper is trying to address.

Combining quantization and sparsity for LLM compression

Resolving conflicting weight distribution requirements

Minimizing performance degradation through error compensation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint quantization and sparsity framework

Training-free error compensation method

Second-order Hessian objective optimization

🔎 Similar Papers

No similar papers found.

Authors to Follow