EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

๐Ÿ“… 2026-04-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

240K/year
๐Ÿค– AI Summary
Deploying large language models on resource-constrained devices often suffers from severe performance degradation or prohibitive retraining costs when using ultra-low-bit quantization (<4-bit). To address this challenge, this work proposes EdgeRazor, a novel framework that uniquely integrates mixed-precision structured quantization, layer-adaptive feature distillation, and entropy-aware KL divergence optimization. Remarkably, EdgeRazor achieves state-of-the-art performance at just 1.88 bitsโ€”outperforming existing 2-bit and 3-bit methods by 11.27 and 4.38 points, respectively. The quantized Qwen3-0.6B model operates at only 1.58 bits, occupying merely 0.19 GB of storage, delivering a 15.16ร— inference speedup, and reducing training costs by 4โ€“10ร—, thereby striking an exceptional balance among efficiency, accuracy, and deployment feasibility.
๐Ÿ“ Abstract
Recent years have witnessed an increasing interest in deploying LLMs on resource-constrained devices, among which quantization has emerged as a promising lightweight technique that converts full-precision model weights and activations into lower-bit formats. Existing weight quantization approaches can be roughly divided into three categories: Post-Training Quantization (PTQ) that calibrates quantized parameters on a small dataset without retraining but suffers from severe performance degradation below 4-bit, Quantization-Aware Training (QAT) that searches low-bit parameters using surrogate gradients but demands substantial computational resources, and Quantization-Aware Distillation that integrates QAT with knowledge transfer from a full-precision teacher but manually selects features to distill and relies heavily on teacher-specific data. In this paper, we propose EdgeRazor, a lightweight framework for LLMs with mixed-precision and extremely low-bit weight quantization. The EdgeRazor framework contains three modules: Mixed-Precision Quantization-Aware Distillation for the fine-grained control of precision, Adaptive Feature Distillation that derives an $n$-bit student from its 16-bit teacher, and Entropy-Aware KL Divergence on both human-annotated and distilled datasets, whose forward-reverse balance is determined solely by the teacher's output distribution. Empirical investigations of EdgeRazor are conducted on base, instruction-tuned, and multimodal LLMs. Notably, EdgeRazor with 1.88-bit surpasses all contenders with the 3-bit precision, especially outperforms the leading 2-bit PTQ methods by 11.3 points, within a 4-10$\times$ lower training budget than the leading QAT approach. EdgeRazor delivers higher compression ratios at all bit width; the 1.58-bit Qwen3-0.6B reduces storage from 1.41 GB to 0.28 GB while accelerating decoding by 15.1$\times$ relative to the 16-bit baseline.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Low-bit Quantization
Performance Degradation
Resource-constrained Devices
Quantization-aware Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-Precision Quantization
Quantization-Aware Distillation
Layer-Adaptive Feature Distillation
Entropy-Aware KL Divergence
Lightweight LLM Framework
S
Shu-Hao Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210063, China; School of Intelligent Science and Technology, Nanjing University, Suzhou 215163, China
L
Le-Tong Huang
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210063, China; School of Intelligent Science and Technology, Nanjing University, Suzhou 215163, China
X
Xiang-Sheng Deng
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210063, China; School of Intelligent Science and Technology, Nanjing University, Suzhou 215163, China
X
Xin-Yi Zou
Microsoft AI, Beijing 100080, China
Chen Wu
Chen Wu
Microsoft, Tencent, Alibaba, Baidu
Information RetrievalNatural Language Processing
N
Nan Li
Microsoft AI, Beijing 100080, China
S
Shao-Qun Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210063, China; School of Intelligent Science and Technology, Nanjing University, Suzhou 215163, China
Zhi-Hua Zhou
Zhi-Hua Zhou
Nanjing University
Artificial IntelligenceMachine LearningData Mining