🤖 AI Summary
To address underutilized activation sparsity in large language model (LLM) inference, this paper proposes a training-free, ultra-lightweight activation sparsity prediction method. The approach first replaces SiLU with ReLU to enhance intrinsic activation sparsity, then rapidly estimates per-layer sparsity via bitwise sign comparisons between inputs and weights—requiring no floating-point operations. A tunable conservatism parameter enables adaptive thresholding, balancing accuracy and speed. Crucially, the method avoids online statistics and auxiliary parameters entirely, and is fully compatible with standard sparse tensor computation backends. Experiments on mainstream LLMs demonstrate significantly higher inference acceleration than state-of-the-art methods, with accuracy degradation bounded by ≤1 percentage point. This work achieves, for the first time, zero-training, bit-level, and controllably sparse activation prediction—setting a new benchmark in efficient LLM inference.
📝 Abstract
Leveraging sparsity is crucial for optimizing large language model inference. however, modern LLMs employing SiLU as their activation function exhibit minimal activation sparsity. Recent research has proposed replacing SiLU with ReLU to induce significant activation sparsity and showed no downstream task accuracy degradation through fine tuning. However, taking full advantage of it required training a predictor to estimate this sparsity. In this paper, we introduce SparseInfer, a simple, light weight, and training free predictor for activation sparsity of ReLU field LLMs, in which activation sparsity is predicted by comparing only the sign bits of inputs and weights. To compensate for possible prediction inaccuracy, an adaptive tuning of the predictor's conservativeness is enabled, which can also serve as a control knob for optimizing LLM inference. The proposed method achieves approximately faster inference speed over the state of the art, with negligible accuracy loss of within 1%p.