ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
This work addresses the significant accuracy degradation in large language models caused by outlier-induced errors during low-bit activation quantization, which hinders efficient deployment. The authors propose a lightweight post-training rotation calibration method that formulates the alignment of activations toward the vertices of an inscribed hypercube as an orthogonal Procrustes problem, enabling gradient-free, closed-form rotational updates. Coupled with an online calibration mechanism that eliminates the need to store extensive activation data, the approach uniformly distributes activation energy across all dimensions. Evaluated on Llama-2 and Llama-3 models (3B–70B), the method achieves perplexity and commonsense reasoning performance on par with or superior to existing techniques—without requiring end-to-end retraining or large-scale activation caching.
📝 Abstract
Large language models (LLMs) are costly to deploy due to their large memory footprint and high inference cost. Weight-activation quantization can reduce these costs, but low-bit activation quantization remains difficult because activation outliers induce large quantization error. Recent rotation-based methods address this by applying orthogonal transformations that redistribute activation magnitude across dimensions, but existing approaches either require expensive end-to-end rotation training or rely on stored activation corpora, introducing significant compute or storage overhead. We propose a lightweight post-training rotation calibration method for LLM activation quantization. Our method learns orthogonal rotations that align normalized activations with the corners of an inscribed hypercube, encouraging activation energy to be distributed more evenly across dimensions. This objective admits an efficient closed-form update via the orthogonal Procrustes problem, avoiding gradient-based optimization over the orthogonal group. We further introduce an online calibration procedure that updates rotations as calibration samples are processed, eliminating the need to store activations on disk and allowing rotations to adapt to quantized activation distributions during calibration. Experiments on Llama-2 and Llama-3 models from 3B to 70B parameters show that our method achieves competitive or improved performance across perplexity benchmarks and common sense reasoning tasks while avoiding both costly end-to-end training and large offline activation storage.
Problem

Research questions and friction points this paper is trying to address.

activation quantization
outliers
orthogonal rotation
post-training calibration
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

activation quantization
orthogonal rotation
post-training calibration
online calibration
LLM compression
🔎 Similar Papers