Activation Compression in LLMs: Theoretical Analysis and Efficient Algorithm

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

239K/year
🤖 AI Summary
This work addresses the substantial memory overhead of activation storage in large language model training, a challenge inadequately tackled by existing compression methods lacking theoretical foundations. We establish the first theoretical framework for activation compression, proving that unbiased compression in linear operators ensures convergence while revealing the risks associated with compressing nonlinear operators. Building on this insight, we propose an activation–gradient co-compression method that reuses low-rank factors from activation compression to compress gradients without additional computation or gradient error. Under assumptions of unbiased low-rank compression, bounded gradient variance, and L-smoothness, we provide rigorous convergence guarantees. Experiments on Qwen and LLaMA demonstrate that our approach significantly reduces memory consumption during both pretraining and multi-task fine-tuning while preserving high model accuracy.
📝 Abstract
Training large language models (LLMs) is highly memory-intensive, as training must store not only weights and optimizer states but also intermediate activations for backpropagation. While existing memory-efficient methods largely focus on gradients and optimizer states, activation compression is less well established due to the lack of LLM-tailored theory and guarantees. In this work, we develop a theoretical framework showing that activation compression is safe for linear operators when activation compression is unbiased, but problematic for nonlinear ones. We further derive gradient variance bound and establish convergence guarantees for applying activation compression to all linear operators under the standard $L$-smoothness assumption, showing that it does not change the convergence rate. Guided by the theory, we propose an activation-gradient co-compression method that reuses low-rank activation factors to compress linear-layer gradients without extra computation or additional gradient error. We conduct extensive experiments on Qwen and LLaMA models using a pretraining benchmark and multiple fine-tuning benchmarks to validate our theory and demonstrate competitive performance of our method in both accuracy and compression efficiency. We provide our code in the supplementary material for reproducibility.
Problem

Research questions and friction points this paper is trying to address.

activation compression
large language models
memory efficiency
theoretical analysis
intermediate activations
Innovation

Methods, ideas, or system contributions that make the work stand out.

activation compression
large language models
gradient variance
convergence guarantee
co-compression