🤖 AI Summary
Deploying large language models (LLMs) on resource-constrained edge devices necessitates lightweighting, particularly for the computationally intensive feed-forward network (FFN) modules.
Method: This work systematically identifies, for the first time, the *inducible sparsity* in FFN activations—orthogonal to activation-function-specific designs (e.g., ReLU)—and proposes a general activation sparsification paradigm. It introduces a zero-forcing threshold tuning mechanism, coupled with predictive activation pattern modeling, weight prefetching, and lazy loading, to jointly reduce memory footprint and computational cost by ~50%. The approach is orthogonal to existing weight-compression techniques and mitigates cache pollution.
Contribution/Results: Evaluated on mainstream open-source LLMs, the method achieves near-lossless perplexity after FFN compression while significantly accelerating inference. It provides a novel, practical pathway for efficient LLM deployment on edge devices.
📝 Abstract
Deploying local AI models, such as Large Language Models (LLMs), to edge devices can substantially enhance devices’ independent capabilities, alleviate the server’s burden, and lower the response time. Owing to these tremendous potentials, many big tech companies have been actively promoting edge LLM evolution and released several lightweight Small Language Models (SLMs) to bridge this gap. However, SLMs currently only work well on limited real-world applications. We still have huge motivations to deploy more powerful (larger-scale) AI models on edge devices and enhance their smartness level. Unlike the conventional approaches for AI model compression, we investigate from activation sparsity. The activation sparsity method is orthogonal and combinable with existing techniques to maximize compression rate while maintaining great accuracy. According to statistics of open-source LLMs, their Feed-Forward Network (FFN) components typically comprise a large proportion of parameters (around $ frac{2}{3}$). This internal feature ensures that our FFN optimizations would have a better chance of achieving effective compression. Moreover, our findings are beneficial to general LLMs and are not restricted to ReLU-based models.This work systematically investigates the tradeoff between enforcing activation sparsity and perplexity (accuracy) on state-of-the-art LLMs. Our empirical analysis demonstrates that we can obtain around 50% of main memory and computing reductions for critical FFN components with negligible accuracy degradation. This extra 50% sparsity does not naturally exist in the current LLMs, which require tuning LLMs’ activation outputs by injecting zero-enforcing thresholds. To obtain the benefits of activation sparsity, we provide a guideline for the system architect for LLM prediction and prefetching. Moreover, we further verified the predictability of activation patterns in recent LLMs. The success prediction allows the system to prefetch the necessary weights while omitting the inactive ones and their successors (compress models from the memory’s perspective), therefore lowering cache/memory pollution and reducing LLM execution time on resource-constraint edge devices.