Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the severe performance degradation of 1-bit post-training quantization (PTQ) for large language models (LLMs) deployed on resource-constrained devices, this work identifies the fundamental failure mechanism of output alignment under ultra-low-bit quantization. We propose the first data-aware activation error accumulation modeling framework tailored for 1-bit PTQ. Methodologically, it integrates calibration-data-driven error propagation analysis, output-layer sensitivity-weighted alignment, and lightweight gradient approximation for joint weight–output calibration. Extensive experiments on mainstream LLMs—including Llama-2/3 and Phi-3—demonstrate that our approach achieves an average task accuracy improvement of 15.2% over existing 1-bit PTQ methods, while incurring negligible calibration overhead. This represents the first systematic solution to activation error accumulation in 1-bit PTQ, enabling practical deployment of highly compressed LLMs without significant accuracy loss.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression techniques have been proposed, including quantization, pruning, and knowledge distillation. Among these, post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration, enabling low-cost deployment. Recent advances for post-training quantization have demonstrated that even sub-4-bit methods can maintain most of the original model performance. However, 1-bit quantization that converts floating-point weights to (pm)1, remains particularly challenging, as existing 1-bit PTQ methods often suffer from significant performance degradation compared to the full-precision models. Specifically, most of existing 1-bit PTQ approaches focus on weight alignment, aligning the full-precision model weights with those of the quantized models, rather than directly aligning their outputs. Although the output-matching approach objective is more intuitive and aligns with the quantization goal, naively applying it in 1-bit LLMs often leads to notable performance degradation. In this paper, we investigate why and under what conditions output-matching fails, in the context of 1-bit LLM quantization. Based on our findings, we propose a novel data-aware PTQ approach for 1-bit LLMs that explicitly accounts for activation error accumulation while keeping optimization efficient. Empirical experiments demonstrate that our solution consistently outperforms existing 1-bit PTQ methods with minimal overhead.

Problem

Research questions and friction points this paper is trying to address.

Addresses significant performance degradation in 1-bit post-training quantization of LLMs

Investigates failure conditions of output-matching approaches in 1-bit quantization

Proposes a data-aware method to manage activation error accumulation efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-aware PTQ approach for 1-bit LLMs

Explicitly accounts for activation error accumulation

Keeps optimization efficient with minimal overhead

🔎 Similar Papers

No similar papers found.

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

LLM Post-Training Engineer, Research & Product

TikTok

San Jose, California

Authors to Follow