QTIP: Quantization with Trellises and Incoherence Processing

📅 2024-06-17

🏛️ Neural Information Processing Systems

📈 Citations: 14

✨ Influential: 4

career value

190K/year

🤖 AI Summary

Post-training quantization (PTQ) of large language models (LLMs) faces a fundamental accuracy bottleneck in high-dimensional vector quantization, where codebook size grows exponentially with dimensionality, especially under ultra-low-bit regimes (e.g., W4A4). Method: This work introduces trellis-coded quantization (TCQ)—a lattice-based quantization framework—to LLM weight PTQ for the first time. We propose a displacement-based trellis structure that decouples codebook size from dimension and bit rate; integrate incoherence-aware weight preprocessing; design displacement-driven trellis decoding; and adopt a hybrid encoding strategy combining lookup-only and lookup-free components. Contribution/Results: Our method enables efficient PTQ in ultra-high dimensions (far exceeding 8D), achieving significantly improved quantization fidelity at W4A4 and other extreme low-bit settings. It attains state-of-the-art inference throughput while maintaining strong hardware efficiency and compatibility.

Technology Category

Application Category

📝 Abstract

Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions ($le 8$) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient"bitshift"trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.

Problem

Research questions and friction points this paper is trying to address.

Reducing LLM memory footprint via post-training quantization

Overcoming limitations of low-dimensional vector quantization

Enhancing quantization quality and speed with trellis codes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses trellis coded quantization (TCQ)

Enables ultra-high-dimensional quantization

Hardware-efficient bitshift trellis structure

🔎 Similar Papers

No similar papers found.

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Post-Training Platform Infrastructure Engineer

AMD

San Jose, CA (Hybrid) / other US locations

Authors to Follow