HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Extreme low-bit (2–4 bit) quantization of large language models suffers significant performance degradation due to activation outliers and anisotropic weight curvature. To address this, this work proposes HARP, a learnable structured bilateral orthogonal processor that replaces the fixed Hadamard transform with an adaptive quantization basis while preserving full-precision equivalence. HARP introduces, for the first time, an adaptive rotation mechanism based on Hadamard preconditioning, supporting arbitrary dimensions and enabling dynamic adaptation across layers, calibration data, and quantizers. Its rotation matrix is efficiently parameterized via sparse butterfly-block orthogonal products, combined with mixed-radix scheduling and Hadamard initialization, allowing effective fitting using only calibration data. Experiments demonstrate that HARP substantially improves perplexity and zero-shot accuracy across models ranging from 1B to 70B parameters, achieving an inference speed of 128 tokens per second—more than double the 61 tokens per second of FP16.

📝 Abstract

Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantization remains highly sensitive to activation outliers and anisotropic weight curvature. Existing incoherence-based PTQ methods mitigate this issue with fixed randomized Hadamard transforms (RHTs), which improve quantization robustness but cannot adapt the rotated basis to the layer, calibration distribution, or quantizer. We introduce HARP (Hadamard-preconditioned Adaptive Rotation Processor), a learnable structured two-sided orthogonal processor that replaces fixed Hadamard mixing while preserving exact full-precision equivalence. HARP represents each rotation as a product of sparse butterfly-like block-orthogonal stages, supports non-power-of-two dimensions via Mixed-Radix schedules, and initializes to the RHT processor up to a fixed permutation. Fitted only on calibration data, HARP adapts the quantization basis to each layer and backend. Across 2-4 bit settings on models ranging from 1B to 70B parameters, HARP improves perplexity and zero-shot accuracy over fixed RHT. Importantly, HARP preserves deployment efficiency, reaching 128 tok/s versus 61 tok/s for FP16.

Problem

Research questions and friction points this paper is trying to address.

post-training quantization

activation outliers

anisotropic weight curvature

extreme low-bit quantization

adaptive rotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive rotation

structured orthogonal processor

extreme quantization