Precision Where It Matters: A Novel Spike Aware Mixed-Precision Quantization Strategy for LLaMA-based Language Models

📅 2025-04-30
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the substantial performance degradation caused by quantization in deploying large language models (e.g., LLaMA), this paper proposes an activation-spike-aware mixed-precision quantization method. We first identify that activation spikes in LLaMA architectures are highly concentrated in specific projection layers—a previously unobserved phenomenon—and accordingly design an architecture-customized quantization strategy: spike-prone projection layers are preserved in high-precision formats (FP16/FP8), while other modules undergo low-bit quantization (INT4/INT8). Our approach integrates layer-wise spike localization with per-tensor calibration. Extensive experiments on LLaMA2, LLaMA3, and Mistral demonstrate that the method significantly reduces perplexity and improves zero-shot accuracy. Notably, under 8-bit per-tensor quantization, it outperforms existing state-of-the-art general-purpose quantization methods across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks. However, their size presents significant challenges for deployment and inference. This paper investigates the quantization of LLMs, focusing on the LLaMA architecture and its derivatives. We challenge existing assumptions about activation outliers in LLMs and propose a novel mixed-precision quantization approach tailored for LLaMA-like models. Our method leverages the observation that activation spikes in LLaMA architectures are predominantly concentrated in specific projection layers. By applying higher precision (FP16 or FP8) to these layers while quantizing the rest of the model to lower bit-widths, we achieve superior performance compared to existing quantization techniques. Experimental results on LLaMA2, LLaMA3, and Mistral models demonstrate significant improvements in perplexity and zero-shot accuracy, particularly for 8-bit per-tensor quantization. Our approach outperforms general-purpose methods designed to handle outliers across all architecture types, highlighting the benefits of architecture-specific quantization strategies. This research contributes to the ongoing efforts to make LLMs more efficient and deployable, potentially enabling their use in resource-constrained environments. Our findings emphasize the importance of considering model-specific characteristics in developing effective quantization pipelines for state-of-the-art language models by identifying and targeting a small number of projections that concentrate activation spikes.
Problem

Research questions and friction points this paper is trying to address.

Quantizing LLaMA models efficiently by targeting activation spikes
Improving performance with mixed-precision for specific projection layers
Enhancing deployability of LLMs in resource-constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-precision quantization for LLaMA-like models
Higher precision for specific projection layers
Improved perplexity and zero-shot accuracy