Activation Sparsity Opportunities for Compressing General Large Language Models

📅 2024-11-22
🏛️ IEEE International Performance, Computing, and Communications Conference
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) on resource-constrained edge devices necessitates lightweighting, particularly for the computationally intensive feed-forward network (FFN) modules. Method: This work systematically identifies, for the first time, the *inducible sparsity* in FFN activations—orthogonal to activation-function-specific designs (e.g., ReLU)—and proposes a general activation sparsification paradigm. It introduces a zero-forcing threshold tuning mechanism, coupled with predictive activation pattern modeling, weight prefetching, and lazy loading, to jointly reduce memory footprint and computational cost by ~50%. The approach is orthogonal to existing weight-compression techniques and mitigates cache pollution. Contribution/Results: Evaluated on mainstream open-source LLMs, the method achieves near-lossless perplexity after FFN compression while significantly accelerating inference. It provides a novel, practical pathway for efficient LLM deployment on edge devices.

Technology Category

Application Category

📝 Abstract
Deploying local AI models, such as Large Language Models (LLMs), to edge devices can substantially enhance devices’ independent capabilities, alleviate the server’s burden, and lower the response time. Owing to these tremendous potentials, many big tech companies have been actively promoting edge LLM evolution and released several lightweight Small Language Models (SLMs) to bridge this gap. However, SLMs currently only work well on limited real-world applications. We still have huge motivations to deploy more powerful (larger-scale) AI models on edge devices and enhance their smartness level. Unlike the conventional approaches for AI model compression, we investigate from activation sparsity. The activation sparsity method is orthogonal and combinable with existing techniques to maximize compression rate while maintaining great accuracy. According to statistics of open-source LLMs, their Feed-Forward Network (FFN) components typically comprise a large proportion of parameters (around $ frac{2}{3}$). This internal feature ensures that our FFN optimizations would have a better chance of achieving effective compression. Moreover, our findings are beneficial to general LLMs and are not restricted to ReLU-based models.This work systematically investigates the tradeoff between enforcing activation sparsity and perplexity (accuracy) on state-of-the-art LLMs. Our empirical analysis demonstrates that we can obtain around 50% of main memory and computing reductions for critical FFN components with negligible accuracy degradation. This extra 50% sparsity does not naturally exist in the current LLMs, which require tuning LLMs’ activation outputs by injecting zero-enforcing thresholds. To obtain the benefits of activation sparsity, we provide a guideline for the system architect for LLM prediction and prefetching. Moreover, we further verified the predictability of activation patterns in recent LLMs. The success prediction allows the system to prefetch the necessary weights while omitting the inactive ones and their successors (compress models from the memory’s perspective), therefore lowering cache/memory pollution and reducing LLM execution time on resource-constraint edge devices.
Problem

Research questions and friction points this paper is trying to address.

Language Model Compression
Accuracy Preservation
Mobile Device Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation Sparsity
Model Size Reduction
Efficient Inference
🔎 Similar Papers
No similar papers found.
N
Nobel Dhar
College of Computing and Software Engineering, Kennesaw State University
Bobin Deng
Bobin Deng
Assistant Professor of Computer Science, Kennesaw State University
Computer ArchitectureMachine LearningNeuromorphic ComputingAI for Science
M
Md Romyull Islam
College of Computing and Software Engineering, Kennesaw State University
K
Kazi Fahim Ahmad Nasif
College of Computing and Software Engineering, Kennesaw State University
L
Liang Zhao
College of Computing and Software Engineering, Kennesaw State University
K
Kun Suo
College of Computing and Software Engineering, Kennesaw State University