EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Deploying large language models on resource-constrained devices often suffers from severe performance degradation or prohibitive retraining costs when using ultra-low-bit quantization (<4-bit). To address this challenge, this work proposes EdgeRazor, a novel framework that uniquely integrates mixed-precision structured quantization, layer-adaptive feature distillation, and entropy-aware KL divergence optimization. Remarkably, EdgeRazor achieves state-of-the-art performance at just 1.88 bits—outperforming existing 2-bit and 3-bit methods by 11.27 and 4.38 points, respectively. The quantized Qwen3-0.6B model operates at only 1.58 bits, occupying merely 0.19 GB of storage, delivering a 15.16× inference speedup, and reducing training costs by 4–10×, thereby striking an exceptional balance among efficiency, accuracy, and deployment feasibility.

📝 Abstract

Recent years have witnessed an increasing interest in deploying LLMs on resource-constrained devices, among which quantization has emerged as a promising lightweight technique that converts full-precision model weights and activations into lower-bit formats. Existing weight quantization approaches can be roughly divided into three categories: Post-Training Quantization (PTQ) that calibrates quantized parameters on a small dataset without retraining but suffers from severe performance degradation below 4-bit, Quantization-Aware Training (QAT) that searches low-bit parameters using surrogate gradients but demands substantial computational resources, and Quantization-Aware Distillation that integrates QAT with knowledge transfer from a full-precision teacher but manually selects features to distill and relies heavily on teacher-specific data. In this paper, we propose EdgeRazor, a lightweight framework for LLMs with mixed-precision and extremely low-bit weight quantization. The EdgeRazor framework contains three modules: Mixed-Precision Quantization-Aware Distillation for the fine-grained control of precision, Adaptive Feature Distillation that derives an $n$-bit student from its 16-bit teacher, and Entropy-Aware KL Divergence on both human-annotated and distilled datasets, whose forward-reverse balance is determined solely by the teacher's output distribution. Empirical investigations of EdgeRazor are conducted on base, instruction-tuned, and multimodal LLMs. Notably, EdgeRazor with 1.88-bit surpasses all contenders with the 3-bit precision, especially outperforms the leading 2-bit PTQ methods by 11.3 points, within a 4-10$\times$ lower training budget than the leading QAT approach. EdgeRazor delivers higher compression ratios at all bit width; the 1.58-bit Qwen3-0.6B reduces storage from 1.41 GB to 0.28 GB while accelerating decoding by 15.1$\times$ relative to the 16-bit baseline.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Low-bit Quantization

Performance Degradation

Resource-constrained Devices

Quantization-aware Training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-Precision Quantization

Quantization-Aware Distillation

Layer-Adaptive Feature Distillation