SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference

📅 2024-11-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address underutilized activation sparsity in large language model (LLM) inference, this paper proposes a training-free, ultra-lightweight activation sparsity prediction method. The approach first replaces SiLU with ReLU to enhance intrinsic activation sparsity, then rapidly estimates per-layer sparsity via bitwise sign comparisons between inputs and weights—requiring no floating-point operations. A tunable conservatism parameter enables adaptive thresholding, balancing accuracy and speed. Crucially, the method avoids online statistics and auxiliary parameters entirely, and is fully compatible with standard sparse tensor computation backends. Experiments on mainstream LLMs demonstrate significantly higher inference acceleration than state-of-the-art methods, with accuracy degradation bounded by ≤1 percentage point. This work achieves, for the first time, zero-training, bit-level, and controllably sparse activation prediction—setting a new benchmark in efficient LLM inference.

Technology Category

Application Category

📝 Abstract

Leveraging sparsity is crucial for optimizing large language model inference. however, modern LLMs employing SiLU as their activation function exhibit minimal activation sparsity. Recent research has proposed replacing SiLU with ReLU to induce significant activation sparsity and showed no downstream task accuracy degradation through fine tuning. However, taking full advantage of it required training a predictor to estimate this sparsity. In this paper, we introduce SparseInfer, a simple, light weight, and training free predictor for activation sparsity of ReLU field LLMs, in which activation sparsity is predicted by comparing only the sign bits of inputs and weights. To compensate for possible prediction inaccuracy, an adaptive tuning of the predictor's conservativeness is enabled, which can also serve as a control knob for optimizing LLM inference. The proposed method achieves approximately faster inference speed over the state of the art, with negligible accuracy loss of within 1%p.

Problem

Research questions and friction points this paper is trying to address.

Sparse Utilization

Large Language Models

Efficiency Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

SparseInfer

ReLU Activation

Inference Acceleration

🔎 Similar Papers

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models