SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe accuracy degradation of large language models (LLMs) under ultra-low-bit (≤4-bit) quantization—caused by activation outliers and inter-channel variance—this work pioneers a Fourier-domain perspective for modeling activation distributions. We propose a two-stage frequency-domain quantization framework: first, spectral decomposition and outlier redistribution migrate spike energy from activations to weights; second, a channel-adaptive low-frequency truncation mechanism dynamically preserves dominant signal components while suppressing high-frequency noise. The method incorporates a lightweight runtime truncation module, enabling deployment-friendly inference. Evaluated on LLaMA-3 8B, our approach achieves 4-bit weight and 4-bit activation quantization, with only a 1.5% drop in zero-shot task accuracy, 2× inference speedup, and memory footprint reduced to one-third of the full-precision baseline.

Technology Category

Application Category

📝 Abstract
The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression -- targeting ultra-low-bit quantization for both activations and weights -- from a Fourier frequency domain perspective. We propose SpecQuant, a two-stage framework that tackles activation outliers and cross-channel variance. In the first stage, activation outliers are smoothed and transferred into the weight matrix to simplify downstream quantization. In the second stage, we apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy, improving quantization robustness. Our method builds on the principle that most of the weight energy is concentrated in low-frequency components, which can be retained with minimal impact on model accuracy. To enable runtime adaptability, we introduce a lightweight truncation module during inference that adjusts truncation thresholds based on channel characteristics. On LLaMA-3 8B, SpecQuant achieves 4-bit quantization for both weights and activations, narrowing the zero-shot accuracy gap to only 1.5% compared to full precision, while delivering 2 times faster inference and 3times lower memory usage.
Problem

Research questions and friction points this paper is trying to address.

Addresses ultra-low-bit quantization for LLM weights and activations
Reduces activation outliers and cross-channel variance via spectral decomposition
Enables adaptive frequency truncation during inference for accuracy preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Smoothing activation outliers via weight transfer
Applying channel-wise low-frequency Fourier truncation
Using adaptive truncation module for runtime adjustment
🔎 Similar Papers
No similar papers found.
Z
Zhixiong Zhao
Shanghai Jiao Tong University, Nanyang Technological University
Fangxin Liu
Fangxin Liu
Shanghai Jiao Tong University
In-memory Computing、Brian-inspired Neuromorphic Computing
J
Junjie Wang
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute
C
Chenyang Guan
Shanghai Jiao Tong University
Z
Zongwu Wang
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute
L
Li Jiang
Shanghai Jiao Tong University, Shanghai Qi Zhi Institute
H
Haibing Guan
Shanghai Jiao Tong University