HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the optimization challenges in ultra-low-bit quantization-aware training (QAT) for large language models, where premature use of hard rounding and straight-through estimators leads to gradient mismatch and poor convergence. To overcome this, the authors propose a Hessian-guided differentiable QAT framework that replaces hard quantization with a temperature-controlled softmax relaxation. Leveraging tensor-level Hessian trace as a lightweight curvature signal, the method dynamically adjusts the temperature to enable sensitivity-aware progressive discretization. Evaluated on Llama-3.2, the approach substantially outperforms existing ternary QAT baselines, achieving zero-shot accuracy improvements of 5.39% and 4.34% on 1B and 3B models, respectively, and effectively recovering the representational capacity of 1.58-bit models.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low-bit quantization. However, most quantization-aware training (QAT) methods apply hard rounding and the straight-through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian-guided differentiable QAT framework for extremely low-bit LLMs, which replaces the rigid step function with a temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor-wise Hessian trace metric as a lightweight curvature signal to drive fine-grained temperature annealing, enabling sensitivity-aware discretization across the model. Evaluations on Llama-3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero-shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian-guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58-bit LLMs. The code is available at https://github.com/hestia2026/Hestia.

Problem

Research questions and friction points this paper is trying to address.

quantization-aware training

extremely low-bit quantization

gradient mismatch

large language models

Hessian

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hessian-guided

differentiable quantization

temperature annealing