🤖 AI Summary
Deploying large language models (LLMs) on edge devices faces challenges in dynamic pruning: existing static or predictive methods suffer from poor generalization, while zero-shot approaches fail under short-prompt or long-generation scenarios.
Method: We propose a training-free, inference-overhead-free dynamic pruning method that jointly leverages global model statistics—neuron activation magnitude and influence—and prompt-specific local features. A rank-aggregation algorithm dynamically ranks feed-forward network neurons to enable fine-grained, context-aware selection of critical units.
Contribution/Results: Our approach requires no auxiliary predictors and introduces zero runtime latency. Evaluated across multiple LLMs and benchmarks, it consistently outperforms state-of-the-art training-free pruning methods. Notably, it maintains high output quality in long-text generation while significantly improving inference efficiency—achieving superior trade-offs between accuracy and computational cost for edge deployment.
📝 Abstract
Deploying Large Language Models (LLMs) on edge hardware demands aggressive, prompt-aware dynamic pruning to reduce computation without degrading quality. Static or predictor-based schemes either lock in a single sparsity pattern or incur extra runtime overhead, and recent zero-shot methods that rely on statistics from a single prompt fail on short prompt and/or long generation scenarios. We introduce A/I-GLASS: Activation- and Impact-based Global-Local neural importance Aggregation for feed-forward network SparSification, two training-free methods that dynamically select FFN units using a rank-aggregation of prompt local and model-intrinsic global neuron statistics. Empirical results across multiple LLMs and benchmarks demonstrate that GLASS significantly outperforms prior training-free methods, particularly in challenging long-form generation scenarios, without relying on auxiliary predictors or adding any inference overhead.