Tequila: Trapping-free Ternary Quantization for Large Language Models

๐Ÿ“… 2025-09-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the problem of weight trapping in dead zones during ternary quantization of large language models (LLMs), which causes gradient vanishing and significant accuracy degradation, this paper proposes a trapping-free ternary quantization method with values {โˆ’1, 0, 1}. The core method dynamically identifies weights stuck in dead zones and reparameterizes them as learnable, dynamic biasesโ€”enabling continuous signal propagation and effective gradient backpropagation, thereby enhancing model capacity and optimization stability. Crucially, the approach eliminates reliance on mixed-precision multiplication, requiring only hardware-friendly addition operations for both efficient forward and backward passes. Evaluated across five standard benchmarks, it consistently outperforms existing ternary methods: ARC accuracy improves by over 4%, the gap to full-precision performance narrows to less than 1%, and inference speedup reaches 3.0ร—.

Technology Category

Application Category

๐Ÿ“ Abstract
Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose Tequila, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly zero inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves >4% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within <1% gap) with a 3.0x inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at https://github.com/Tencent/AngelSlim.
Problem

Research questions and friction points this paper is trying to address.

Addresses accuracy loss in ternary quantization of large language models
Solves deadzone trapping issue during weight quantization process
Enables efficient LLM deployment on resource-constrained edge devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces deadzone-trapped weights with dynamic biases
Enables direct gradient signals during backpropagation
Achieves near full-precision accuracy with ternary quantization
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Hong Huang
City University of Hong Kong
D
Decheng Wu
Tencent
R
Rui Cen
Tencent
G
Guanghua Yu
Tencent
Zonghang Li
Zonghang Li
MBZUAI
Distributed MLEdge AIOn-device LLM
K
Kai Liu
Tencent
J
Jianchen Zhu
Tencent
P
Peng Chen
Tencent
X
Xue Liu
McGill University
Dapeng Wu
Dapeng Wu
Chongqing University of Posts and Telecommunications
Wireless NetworkSocial Computing