STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing generative AI models suffer from severe accuracy degradation under ≤4-bit activation quantization, alongside high inference latency and power consumption. To address this, we propose a sequence-aware mixed-precision quantization framework. Our core innovation is the first application of invertible linear transformations along the token sequence dimension to explicitly model local strong correlations inherent in language and vision data; this is synergistically combined with a high-precision preservation mechanism for critical tokens, enabling reduced average bit-width without compromising information integrity. The framework is orthogonal to existing weight quantization and feature transformation techniques. Extensive experiments on mainstream large language models (LLMs) and large vision models (LVMs) demonstrate that our method significantly improves accuracy under 4-bit and sub-4-bit activation quantization (+12.3% average gain), while reducing memory footprint by 38% and inference latency by 27%. This establishes a new paradigm for efficient edge deployment of generative AI.

Technology Category

Application Category

📝 Abstract
Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose extit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a novel strategy that applies linear transformations along the extit{sequence} dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.
Problem

Research questions and friction points this paper is trying to address.

Improves low-precision activation quantization for AI models
Applies sequence transformations to exploit data correlation
Maintains accuracy with mixed-precision token representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Applies linear transformations along sequence dimension
Keeps small token subset at higher precision
Reduces average activation bit-width while maintaining accuracy
M
Marco Federici
Qualcomm AI Research
R
Riccardo Del Chiaro
Qualcomm AI Research
Boris van Breugel
Boris van Breugel
Senior ML Researcher, Qualcomm AI Research
quantizationgenerative modelsunsupervised learning
P
Paul Whatmough
Qualcomm AI Research
Markus Nagel
Markus Nagel
Qualcomm AI Research
Machine learningDeep Learning