ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing post-training quantization methods for large language models struggle to balance compression ratio and performance at ultra-low bitwidths (<4-bit). This work proposes ADMM-Q, a layer-wise weight quantization algorithm based on the Alternating Direction Method of Multipliers (ADMM), which achieves high-accuracy low-bit quantization by continuously optimizing reconstruction error while progressively enforcing quantization constraints. The key innovations include Hessian-aware weight reconstruction, ADMM-based operator splitting, adaptive penalty scheduling, preconditioned optimization, and a local search post-processing step, all while remaining compatible with techniques such as outlier clipping, weight rotation, and activation scaling. On Qwen3-8B, ADMM-Q substantially outperforms GPTQ: under W3A16 settings, it reduces WikiText-2 perplexity from 12.85 to 10.06, and under W2A4KV4, it dramatically lowers perplexity from 66.11 to 19.42.

📝 Abstract

Quantization is an effective strategy to reduce the storage and computation footprint of large language models (LLMs). Post-training quantization (PTQ) is a leading approach for compressing LLMs. Popular weight quantization procedures, including GPTQ and RTN, suffer in model utility, especially at aggressive quantization levels (sub-4-bit). We propose ADMM-Q, a novel weight quantization algorithm that considers the layer-wise quantization problem. Our algorithm is based on a combinatorial variant of the Alternating Direction Method of Multipliers (ADMM). Our operator-splitting procedure updates weights continuously to minimize the layer-wise reconstruction error, while gradually enforcing the quantization constraints with convergence guarantees. We propose additional algorithmic enhancements (e.g., penalty scheduling, preconditioning, and a local search post-processing step) to make ADMM-Q efficient at LLM scale. ADMM-Q is modular and can be used as a drop-in replacement for any weight quantizer within existing quantization pipelines: ADMM-Q is fully composable with existing techniques including range clipping, learned or random rotations, and activation scaling. Using ADMM-Q in place of GPTQ on Qwen3-8B, we decrease WikiText-2 perplexity in: (i) the W3A16 weight-only setting (12.85 $\rightarrow$ 10.06); (ii) the W4A8 SmoothQuant procedure (9.29 $\rightarrow$ 8.68); and (iii) the W2A4KV4 SpinQuant procedure (66.11 $\rightarrow$ 19.42).

Problem

Research questions and friction points this paper is trying to address.

post-training quantization

large language models

weight quantization

low-bit quantization

model utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

ADMM-Q

post-training quantization

weight quantization