Efficient LLMs with AMP: Attention Heads and MLP Pruning

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead and slow inference of large language models (LLMs) in resource-constrained deployment scenarios, this paper proposes AMP, a structured pruning method. AMP introduces a novel input-weight projection-based importance scoring mechanism and achieves, for the first time, joint structured pruning of both multi-head attention (MHA) and MLP modules. It supports cross-architecture adaptation (e.g., LLaMA and Phi) while preserving near-zero-shot performance—degradation remains below 0.3%—under a 30% structured pruning ratio. On commonsense reasoning benchmarks, AMP surpasses prior state-of-the-art methods by up to 1.49 percentage points and delivers significant inference speedup. By jointly optimizing compression efficiency, architectural flexibility, and generalization across model families, AMP establishes a new paradigm for efficient LLM deployment.

Technology Category

Application Category

📝 Abstract
Deep learning drives a new wave in computing systems and triggers the automation of increasingly complex problems. In particular, Large Language Models (LLMs) have significantly advanced cognitive tasks, often matching or even surpassing human-level performance. However, their extensive parameters result in high computational costs and slow inference, posing challenges for deployment in resource-limited settings. Among the strategies to overcome the aforementioned challenges, pruning emerges as a successful mechanism since it reduces model size while maintaining predictive ability. In this paper, we introduce AMP: Attention Heads and MLP Pruning, a novel structured pruning method that efficiently compresses LLMs by removing less critical structures within Multi-Head Attention (MHA) and Multilayer Perceptron (MLP). By projecting the input data onto weights, AMP assesses structural importance and overcomes the limitations of existing techniques, which often fall short in flexibility or efficiency. In particular, AMP surpasses the current state-of-the-art on commonsense reasoning tasks by up to 1.49 percentage points, achieving a 30% pruning ratio with minimal impact on zero-shot task performance. Moreover, AMP also improves inference speeds, making it well-suited for deployment in resource-constrained environments. We confirm the flexibility of AMP on different families of LLMs, including LLaMA and Phi.
Problem

Research questions and friction points this paper is trying to address.

Reduces LLM computational costs and slow inference
Compresses LLMs by pruning less critical structures
Improves inference speeds for resource-limited settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

AMP prunes attention heads and MLP structures
Projects input data to assess structural importance
Improves inference speed and maintains performance
🔎 Similar Papers
No similar papers found.
L
Leandro Giusti Mugnaini
Escola Politécnica, Universidade de São Paulo, São Paulo, Brazil
B
B. Yamamoto
Escola Politécnica, Universidade de São Paulo, São Paulo, Brazil
L
Lucas Lauton de Alcantara
Escola Politécnica, Universidade de São Paulo, São Paulo, Brazil
V
Victor Zacarias
Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, Brazil
E
Edson Bollis
Instituto de Ciência e Tecnologia Itaú (ICTi), São Paulo, Brazil
L
Lucas F. A. O. Pellicer
Instituto de Ciência e Tecnologia Itaú (ICTi), São Paulo, Brazil
Anna Helena Reali Costa
Anna Helena Reali Costa
Full Professor of Computer Engineering, Universidade de São Paulo
Artificial IntelligenceMachine LearningReinforcement LearningIntelligent Robotics
Artur Jordao
Artur Jordao
Universidade de São Paulo (USP)
Machine LearningPartial Least SquaresPattern Recognition