STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing generative AI models suffer from severe accuracy degradation under ≤4-bit activation quantization, alongside high inference latency and power consumption. To address this, we propose a sequence-aware mixed-precision quantization framework. Our core innovation is the first application of invertible linear transformations along the token sequence dimension to explicitly model local strong correlations inherent in language and vision data; this is synergistically combined with a high-precision preservation mechanism for critical tokens, enabling reduced average bit-width without compromising information integrity. The framework is orthogonal to existing weight quantization and feature transformation techniques. Extensive experiments on mainstream large language models (LLMs) and large vision models (LVMs) demonstrate that our method significantly improves accuracy under 4-bit and sub-4-bit activation quantization (+12.3% average gain), while reducing memory footprint by 38% and inference latency by 27%. This establishes a new paradigm for efficient edge deployment of generative AI.

Technology Category

Application Category

📝 Abstract

Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose extit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a novel strategy that applies linear transformations along the extit{sequence} dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.

Problem

Research questions and friction points this paper is trying to address.

Improves low-precision activation quantization for AI models

Applies sequence transformations to exploit data correlation

Maintains accuracy with mixed-precision token representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Applies linear transformations along sequence dimension

Keeps small token subset at higher precision

Reduces average activation bit-width while maintaining accuracy

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration