GLASS: Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) on edge devices faces challenges in dynamic pruning: existing static or predictive methods suffer from poor generalization, while zero-shot approaches fail under short-prompt or long-generation scenarios. Method: We propose a training-free, inference-overhead-free dynamic pruning method that jointly leverages global model statistics—neuron activation magnitude and influence—and prompt-specific local features. A rank-aggregation algorithm dynamically ranks feed-forward network neurons to enable fine-grained, context-aware selection of critical units. Contribution/Results: Our approach requires no auxiliary predictors and introduces zero runtime latency. Evaluated across multiple LLMs and benchmarks, it consistently outperforms state-of-the-art training-free pruning methods. Notably, it maintains high output quality in long-text generation while significantly improving inference efficiency—achieving superior trade-offs between accuracy and computational cost for edge deployment.

Technology Category

Application Category

📝 Abstract
Deploying Large Language Models (LLMs) on edge hardware demands aggressive, prompt-aware dynamic pruning to reduce computation without degrading quality. Static or predictor-based schemes either lock in a single sparsity pattern or incur extra runtime overhead, and recent zero-shot methods that rely on statistics from a single prompt fail on short prompt and/or long generation scenarios. We introduce A/I-GLASS: Activation- and Impact-based Global-Local neural importance Aggregation for feed-forward network SparSification, two training-free methods that dynamically select FFN units using a rank-aggregation of prompt local and model-intrinsic global neuron statistics. Empirical results across multiple LLMs and benchmarks demonstrate that GLASS significantly outperforms prior training-free methods, particularly in challenging long-form generation scenarios, without relying on auxiliary predictors or adding any inference overhead.
Problem

Research questions and friction points this paper is trying to address.

Dynamic pruning for LLM edge deployment without quality loss
Overcoming limitations of static and predictor-based sparsity methods
Improving performance on short prompt and long generation scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global-local neural importance aggregation
Training-free dynamic FFN pruning
No auxiliary predictors or overhead
🔎 Similar Papers
No similar papers found.
A
Amirmohsen Sattarifard
Huawei Technologies Canada
S
Sepehr Lavasani
Huawei Technologies Canada
Ehsan Imani
Ehsan Imani
Huawei Technologies Canada
K
Kunlin Zhang
Huawei Technologies Canada
H
Hanlin Xu
Huawei
F
Fengyu Sun
Huawei
Negar Hassanpour
Negar Hassanpour
Senior Researcher, Huawei Technologies Canada
Machine Learning
C
Chao Gao
Huawei Technologies Canada