🤖 AI Summary
This work addresses the severe degradation of high-dimensional feature manifolds under ultra-low bit-width Power-of-Two (PoT) quantization, caused by insufficient angular resolution, which hinders Transformer deployment on edge devices. To overcome this limitation, the authors propose an Orthogonal Residual Projection (ORP) framework that introduces a geometric projection perspective into PoT quantization for the first time, modeling it as a dual-basis orthogonal projection. The method constructs a high-resolution residual lattice using only bit-shifts and additions and replaces gradient-based optimization with an analytical calibration solver. This approach substantially enhances angular resolution at low bit-widths, achieving calibration in approximately 15 minutes. Under W3/A16 settings, it attains a perplexity of 6.10 on LLaMA-2-7B and effectively alleviates multiplier-tree timing bottlenecks in 28nm hardware, enabling efficient and accurate 4/3-bit inference.
📝 Abstract
The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limitations and the critical timing bottlenecks introduced by dense Multiply-Accumulate (MAC) arrays. In the ultra-low bit regime, logarithmic Power-of-Two (PoT) quantization provides a hardware-efficient alternative by replacing MAC operations with bit-shifts. However, the non-uniform exponential lattice is inherently limited by a \textbf{Low Angular Resolution Regime}, a structural flaw that becomes particularly pronounced at sub-4-bit thresholds, leading to a notable degradation of high-dimensional feature manifolds.
To address this geometric limitation, we propose Orthogonal Residual Projection (ORP), an algorithm-hardware co-design framework. By formulating quantization as a dual-basis geometric projection, ORP adaptively synthesizes a higher-resolution residual lattice using strictly shift-and-add operations. Furthermore, ORP's analytical solver offers a practical alternative to computationally intensive gradient-based optimization, reducing the full-model calibration time for LLaMA-2-7B to approximately \textbf{15 minutes}.
Extensive evaluations demonstrate ORP's applicability across modalities and its hardware efficiency. Under the 3-bit (W3/A16) constraint, ORP achieves a perplexity of 6.10 on LLaMA-2-7B, comparing favorably to conventional MAC-intensive baselines like AWQ without relying on asymmetric scaling, while maintaining competitive accuracy in 4-bit scenarios. At the silicon level, standard-cell RTL synthesis at a 28nm node indicates that ORP effectively mitigates the timing bottlenecks associated with dense multiplier trees.