Training-Free Activation Sparsity in Large Language Models

πŸ“… 2024-08-26
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 2
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing large language models (LLMs) struggle to achieve full-model activation sparsification without costly continual pretraining, limiting inference efficiency gains. This paper introduces TEALβ€”a training-agnostic, magnitude-driven, full-layer activation sparsification method that requires no fine-tuning or additional training and supports mainstream architectures (e.g., Llama-2/3, Mistral) under INT4/FP8 weight quantization. TEAL jointly optimizes adaptive magnitude-based pruning with a custom sparse matrix multiplication kernel. Evaluated on models ranging from 7B to 70B parameters, it achieves 40%–50% model-level activation sparsity while incurring <1% accuracy degradation and accelerating decoding by 1.53×–1.8Γ—. To the best of our knowledge, TEAL is the first framework enabling highly compatible, zero-training-overhead activation sparsification for LLMs.

Technology Category

Application Category

πŸ“ Abstract
Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53$ imes$ and 1.8$ imes$ at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.
Problem

Research questions and friction points this paper is trying to address.

Activation sparsity in LLMs
Training-free sparsity method
Improving inference speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free activation sparsity method
Magnitude-based sparsity applied globally
Improved sparse kernels for speedups
J
James Liu
Massachusetts Institute of Technology, Together AI
P
Pragaash Ponnusamy
Together AI
Tianle Cai
Tianle Cai
PhD Student, Princeton University
Machine Learning
H
Han Guo
Massachusetts Institute of Technology
Yoon Kim
Yoon Kim
Associate Professor, MIT
Machine LearningNatural Language ProcessingDeep Learning
Ben Athiwaratkun
Ben Athiwaratkun
Together AI
Artificial Intelligence