🤖 AI Summary
To address the severe accuracy degradation of large language models (LLMs) under ultra-low-bit (≤4-bit) quantization—caused by activation outliers and inter-channel variance—this work pioneers a Fourier-domain perspective for modeling activation distributions. We propose a two-stage frequency-domain quantization framework: first, spectral decomposition and outlier redistribution migrate spike energy from activations to weights; second, a channel-adaptive low-frequency truncation mechanism dynamically preserves dominant signal components while suppressing high-frequency noise. The method incorporates a lightweight runtime truncation module, enabling deployment-friendly inference. Evaluated on LLaMA-3 8B, our approach achieves 4-bit weight and 4-bit activation quantization, with only a 1.5% drop in zero-shot task accuracy, 2× inference speedup, and memory footprint reduced to one-third of the full-precision baseline.
📝 Abstract
The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression -- targeting ultra-low-bit quantization for both activations and weights -- from a Fourier frequency domain perspective. We propose SpecQuant, a two-stage framework that tackles activation outliers and cross-channel variance. In the first stage, activation outliers are smoothed and transferred into the weight matrix to simplify downstream quantization. In the second stage, we apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy, improving quantization robustness. Our method builds on the principle that most of the weight energy is concentrated in low-frequency components, which can be retained with minimal impact on model accuracy. To enable runtime adaptability, we introduce a lightweight truncation module during inference that adjusts truncation thresholds based on channel characteristics. On LLaMA-3 8B, SpecQuant achieves 4-bit quantization for both weights and activations, narrowing the zero-shot accuracy gap to only 1.5% compared to full precision, while delivering 2 times faster inference and 3times lower memory usage.