Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

While ultra-low-bit quantization of large language models reduces deployment costs, it not only incurs numerical precision loss but also causes a systematic degradation in smoothness, significantly impairing generation quality. This work identifies smoothness as a central consideration in quantization design and introduces an analytical framework based on a smoothness proxy metric and sequence neighborhood modeling, revealing that effective token candidates sharply contract within the prediction neighborhood after quantization. Building on this insight, the authors incorporate smoothness-preserving constraints into both post-training quantization (PTQ) and quantization-aware training (QAT). Experiments demonstrate that the proposed approach effectively mitigates decoding tree sparsification and substantially outperforms existing quantization methods that optimize solely for numerical accuracy across multiple benchmarks.

📝 Abstract

Large language models (LLMs) achieve strong performance but incur high deployment costs, motivating extremely low-bit but lossy quantization. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit-width decreases. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality. To validate it, we introduce a simple smoothness-preserving principle in both post-training quantization and quantization-aware training, and demonstrate that preserving smoothness brings additional gains beyond numerical accuracy. The core goal of this paper is to highlight smoothness preservation as an important design consideration for future extreme quantization methods. Code is available at https://github.com/xuyuzhuang11/FINE.

Problem

Research questions and friction points this paper is trying to address.

smoothness

extreme quantization

LLMs

generation quality

numerical accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

smoothness preservation

extreme quantization

LLM quantization