SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the severe accuracy degradation of large language models (LLMs) under ultra-low-bit (≤4-bit) quantization—caused by activation outliers and inter-channel variance—this work pioneers a Fourier-domain perspective for modeling activation distributions. We propose a two-stage frequency-domain quantization framework: first, spectral decomposition and outlier redistribution migrate spike energy from activations to weights; second, a channel-adaptive low-frequency truncation mechanism dynamically preserves dominant signal components while suppressing high-frequency noise. The method incorporates a lightweight runtime truncation module, enabling deployment-friendly inference. Evaluated on LLaMA-3 8B, our approach achieves 4-bit weight and 4-bit activation quantization, with only a 1.5% drop in zero-shot task accuracy, 2× inference speedup, and memory footprint reduced to one-third of the full-precision baseline.

Technology Category

Application Category

📝 Abstract

The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression -- targeting ultra-low-bit quantization for both activations and weights -- from a Fourier frequency domain perspective. We propose SpecQuant, a two-stage framework that tackles activation outliers and cross-channel variance. In the first stage, activation outliers are smoothed and transferred into the weight matrix to simplify downstream quantization. In the second stage, we apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy, improving quantization robustness. Our method builds on the principle that most of the weight energy is concentrated in low-frequency components, which can be retained with minimal impact on model accuracy. To enable runtime adaptability, we introduce a lightweight truncation module during inference that adjusts truncation thresholds based on channel characteristics. On LLaMA-3 8B, SpecQuant achieves 4-bit quantization for both weights and activations, narrowing the zero-shot accuracy gap to only 1.5% compared to full precision, while delivering 2 times faster inference and 3times lower memory usage.

Problem

Research questions and friction points this paper is trying to address.

Addresses ultra-low-bit quantization for LLM weights and activations

Reduces activation outliers and cross-channel variance via spectral decomposition

Enables adaptive frequency truncation during inference for accuracy preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Smoothing activation outliers via weight transfer

Applying channel-wise low-frequency Fourier truncation

Using adaptive truncation module for runtime adjustment

🔎 Similar Papers

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms