🤖 AI Summary
Existing generative AI models suffer from severe accuracy degradation under ≤4-bit activation quantization, alongside high inference latency and power consumption. To address this, we propose a sequence-aware mixed-precision quantization framework. Our core innovation is the first application of invertible linear transformations along the token sequence dimension to explicitly model local strong correlations inherent in language and vision data; this is synergistically combined with a high-precision preservation mechanism for critical tokens, enabling reduced average bit-width without compromising information integrity. The framework is orthogonal to existing weight quantization and feature transformation techniques. Extensive experiments on mainstream large language models (LLMs) and large vision models (LVMs) demonstrate that our method significantly improves accuracy under 4-bit and sub-4-bit activation quantization (+12.3% average gain), while reducing memory footprint by 38% and inference latency by 27%. This establishes a new paradigm for efficient edge deployment of generative AI.
📝 Abstract
Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose extit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a novel strategy that applies linear transformations along the extit{sequence} dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.