WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing training-free activation sparsification methods overlook the interaction between weights and activations as well as inter-module differences in sparsity sensitivity, thereby limiting inference efficiency in large language models. This work proposes a training-free, mixed-granularity activation sparsification approach that introduces, for the first time, a weight-aware mechanism to identify critical channels by adaptively combining precomputed weight norms with activation magnitudes to assign sparsity strategies. It further employs evolutionary search to optimize global sparsity budget allocation and minimizes reconstruction error within each block to preserve accuracy. Breaking away from conventional uniform sparsity, the method enables Llama-3.1 to retain 97% of its original performance at 50% sparsity—outperforming the strongest baseline by 2.23 percentage points—and achieves a 21.4% end-to-end inference speedup.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) offer strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, yet existing methods often rely solely on activation information and uniform sparsity ratios. This overlooks the critical interplay with weights and inter-block sensitivity variation, leading to suboptimal performance. We identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse), which leverages both activation and weight information for adaptive sparsity allocation. Specifically, we introduce a weight-aware mechanism integrating activation magnitudes with precomputed weight norms to accurately identify salient channels. This is combined with a mixed-granularity allocation scheme: a global budget is distributed across blocks via evolutionary search to protect sensitive regions, then refined within blocks to minimize reconstruction error. We improve sparse kernels and demonstrate effectiveness on three representative models. Notably, at 50% sparsity, WiSparse preserves 97% of Llama3.1's dense performance, surpassing the strongest baseline by 2.23 percentage points while achieving a 21.4% acceleration in end-to-end inference speed. Our research advances the limits of training-free approaches for efficient LLM inference, pushing the boundaries of achievable speedup without training.

Problem

Research questions and friction points this paper is trying to address.

activation sparsity

large language models

inference efficiency

weight-activation interaction

sparsity sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

activation sparsity

weight-aware

training-free