OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the severe degradation of high-dimensional feature manifolds under ultra-low bit-width Power-of-Two (PoT) quantization, caused by insufficient angular resolution, which hinders Transformer deployment on edge devices. To overcome this limitation, the authors propose an Orthogonal Residual Projection (ORP) framework that introduces a geometric projection perspective into PoT quantization for the first time, modeling it as a dual-basis orthogonal projection. The method constructs a high-resolution residual lattice using only bit-shifts and additions and replaces gradient-based optimization with an analytical calibration solver. This approach substantially enhances angular resolution at low bit-widths, achieving calibration in approximately 15 minutes. Under W3/A16 settings, it attains a perplexity of 6.10 on LLaMA-2-7B and effectively alleviates multiplier-tree timing bottlenecks in 28nm hardware, enabling efficient and accurate 4/3-bit inference.

📝 Abstract

The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limitations and the critical timing bottlenecks introduced by dense Multiply-Accumulate (MAC) arrays. In the ultra-low bit regime, logarithmic Power-of-Two (PoT) quantization provides a hardware-efficient alternative by replacing MAC operations with bit-shifts. However, the non-uniform exponential lattice is inherently limited by a \textbf{Low Angular Resolution Regime}, a structural flaw that becomes particularly pronounced at sub-4-bit thresholds, leading to a notable degradation of high-dimensional feature manifolds. To address this geometric limitation, we propose Orthogonal Residual Projection (ORP), an algorithm-hardware co-design framework. By formulating quantization as a dual-basis geometric projection, ORP adaptively synthesizes a higher-resolution residual lattice using strictly shift-and-add operations. Furthermore, ORP's analytical solver offers a practical alternative to computationally intensive gradient-based optimization, reducing the full-model calibration time for LLaMA-2-7B to approximately \textbf{15 minutes}. Extensive evaluations demonstrate ORP's applicability across modalities and its hardware efficiency. Under the 3-bit (W3/A16) constraint, ORP achieves a perplexity of 6.10 on LLaMA-2-7B, comparing favorably to conventional MAC-intensive baselines like AWQ without relying on asymmetric scaling, while maintaining competitive accuracy in 4-bit scenarios. At the silicon level, standard-cell RTL synthesis at a 28nm node indicates that ORP effectively mitigates the timing bottlenecks associated with dense multiplier trees.

Problem

Research questions and friction points this paper is trying to address.

Power-of-Two quantization

Low Angular Resolution Regime

ultra-low bit quantization

Large Language Models

Vision Transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthogonal Residual Projection

Power-of-Two Quantization

Multiplier-Free