Quantitative Analysis of Performance Drop in DeepSeek Model Quantization

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
DeepSeek-R1/V3 (671B, FP8) exceeds GPU memory capacity on standard 8-GPU servers, and existing quantization schemes (e.g., Q3_K_M) suffer from accuracy degradation. Method: We systematically evaluate multi-bit quantization impacts on model performance and propose DQ3_K_M—the first dynamic 3-bit quantization method—integrating multi-granularity weight quantization, dynamic group-wise scaling, FP8 baseline alignment, and dual-backend (CUDA/HIP) support with optimized 3-bit sparse tensor kernels for both NVIDIA and Ascend accelerators. Contribution/Results: 4-bit quantization incurs <0.5% average accuracy loss; DQ3_K_M achieves a 9.2% average improvement over Q3_K_M on MMLU and CMMLU benchmarks, matching 4-bit performance on critical tasks. It enables the first efficient single-machine deployment of the full-parameter R1/V3 model.

Technology Category

Application Category

📝 Abstract
Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used technique that helps reduce model memory consumption. However, it is unclear what the performance of DeepSeek-R1 and V3 will be after being quantized. This technical report presents the first quantitative evaluation of multi-bitwidth quantization across the complete DeepSeek model spectrum. Key findings reveal that 4-bit quantization maintains little performance degradation versus FP8 while enabling single-machine deployment on standard NVIDIA GPU devices. We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks. Moreover, DQ3_K_M supports single-machine deployment configurations for both NVIDIA H100/A100 and Huawei 910B. Our implementation of DQ3_K_M is released at https://github.com/UnicomAI/DeepSeek-Eval, containing optimized 3-bit quantized variants of both DeepSeek-R1 and DeepSeek-V3.
Problem

Research questions and friction points this paper is trying to address.

Evaluates performance drop in DeepSeek models after quantization
Proposes DQ3_K_M for efficient 3-bit quantization on GPUs
Enables single-machine deployment of large models via quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

4-bit quantization maintains performance with FP8
Dynamic 3-bit method DQ3_K_M outperforms Q3_K_M
DQ3_K_M supports H100/A100 and Huawei 910B deployment
🔎 Similar Papers
No similar papers found.
E
Enbo Zhao
Unicom Data Intelligence, China Unicom; Data Science & Artificial Intelligence Research Institute, China Unicom
Y
Yi Shen
Unicom Data Intelligence, China Unicom; Data Science & Artificial Intelligence Research Institute, China Unicom
Shuming Shi
Shuming Shi
Tencent AI Lab
NLPtext understandingknowledge miningtext generationweb search
J
Jieyun Huang
Unicom Data Intelligence, China Unicom; Data Science & Artificial Intelligence Research Institute, China Unicom
Z
Zhihao Chen
Unicom Data Intelligence, China Unicom; Data Science & Artificial Intelligence Research Institute, China Unicom
N
Ning Wang
Unicom Data Intelligence, China Unicom; Data Science & Artificial Intelligence Research Institute, China Unicom
S
Siqi Xiao
Unicom Data Intelligence, China Unicom; Data Science & Artificial Intelligence Research Institute, China Unicom
J
Jian Zhang
Unicom Data Intelligence, China Unicom; Data Science & Artificial Intelligence Research Institute, China Unicom
K
Kai Wang
Unicom Data Intelligence, China Unicom; Data Science & Artificial Intelligence Research Institute, China Unicom
Shiguo Lian
Shiguo Lian
CloudMinds