π€ AI Summary
Extreme low-bit (2β4 bit) quantization of large language models suffers significant performance degradation due to activation outliers and anisotropic weight curvature. To address this, this work proposes HARP, a learnable structured bilateral orthogonal processor that replaces the fixed Hadamard transform with an adaptive quantization basis while preserving full-precision equivalence. HARP introduces, for the first time, an adaptive rotation mechanism based on Hadamard preconditioning, supporting arbitrary dimensions and enabling dynamic adaptation across layers, calibration data, and quantizers. Its rotation matrix is efficiently parameterized via sparse butterfly-block orthogonal products, combined with mixed-radix scheduling and Hadamard initialization, allowing effective fitting using only calibration data. Experiments demonstrate that HARP substantially improves perplexity and zero-shot accuracy across models ranging from 1B to 70B parameters, achieving an inference speed of 128 tokens per secondβmore than double the 61 tokens per second of FP16.
π Abstract
Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantization remains highly sensitive to activation outliers and anisotropic weight curvature. Existing incoherence-based PTQ methods mitigate this issue with fixed randomized Hadamard transforms (RHTs), which improve quantization robustness but cannot adapt the rotated basis to the layer, calibration distribution, or quantizer. We introduce HARP (Hadamard-preconditioned Adaptive Rotation Processor), a learnable structured two-sided orthogonal processor that replaces fixed Hadamard mixing while preserving exact full-precision equivalence. HARP represents each rotation as a product of sparse butterfly-like block-orthogonal stages, supports non-power-of-two dimensions via Mixed-Radix schedules, and initializes to the RHT processor up to a fixed permutation. Fitted only on calibration data, HARP adapts the quantization basis to each layer and backend. Across 2-4 bit settings on models ranging from 1B to 70B parameters, HARP improves perplexity and zero-shot accuracy over fixed RHT. Importantly, HARP preserves deployment efficiency, reaching 128 tok/s versus 61 tok/s for FP16.