HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs

πŸ“… 2026-01-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the optimization challenges in ultra-low-bit quantization-aware training (QAT) for large language models, where premature use of hard rounding and straight-through estimators leads to gradient mismatch and poor convergence. To overcome this, the authors propose a Hessian-guided differentiable QAT framework that replaces hard quantization with a temperature-controlled softmax relaxation. Leveraging tensor-level Hessian trace as a lightweight curvature signal, the method dynamically adjusts the temperature to enable sensitivity-aware progressive discretization. Evaluated on Llama-3.2, the approach substantially outperforms existing ternary QAT baselines, achieving zero-shot accuracy improvements of 5.39% and 4.34% on 1B and 3B models, respectively, and effectively recovering the representational capacity of 1.58-bit models.

Technology Category

Application Category

πŸ“ Abstract
As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low-bit quantization. However, most quantization-aware training (QAT) methods apply hard rounding and the straight-through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian-guided differentiable QAT framework for extremely low-bit LLMs, which replaces the rigid step function with a temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor-wise Hessian trace metric as a lightweight curvature signal to drive fine-grained temperature annealing, enabling sensitivity-aware discretization across the model. Evaluations on Llama-3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero-shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian-guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58-bit LLMs. The code is available at https://github.com/hestia2026/Hestia.
Problem

Research questions and friction points this paper is trying to address.

quantization-aware training
extremely low-bit quantization
gradient mismatch
large language models
Hessian
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hessian-guided
differentiable quantization
temperature annealing
low-bit LLMs
quantization-aware training
πŸ”Ž Similar Papers
No similar papers found.
Guoan Wang
Guoan Wang
Stevens Institute of Technology
General Medical AI
Feiyu Wang
Feiyu Wang
Fudan University
computer vision
Z
Zongwei Lv
School of Software and Microelectronics, Peking University, Beijing, China
Y
Yikun Zong
School of Computer Science, Peking University, Beijing, China
Tong Yang
Tong Yang
Peking University, Beijing, China. PKU. εŒ—δΊ¬ε€§ε­¦
SketchNetwork measurementBloom filterIP lookupHash Table