Compressing Large Language Models with Automated Sub-Network Search

📅 2024-10-09

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Large language models (LLMs) suffer from high inference costs, and manually designed pruning strategies exhibit poor generalizability across architectures and tasks. Method: This paper proposes a multi-objective neural architecture search (NAS)-based compression framework tailored for edge deployment. It formulates structured pruning as a differentiable, end-to-end Pareto-optimal subnetwork search problem—jointly optimizing attention heads, hidden neurons, and network layers at multiple granularities—while enforcing structured sparsity constraints to ensure hardware efficiency. Contribution/Results: Evaluated on 11 downstream tasks, the method achieves an average performance gain of 9.85%, reduces edge-side latency by 22%, and significantly outperforms state-of-the-art structured pruning and lightweight fine-tuning approaches. By replacing heuristic, human-crafted pruning policies with automated, gradient-based NAS, it overcomes fundamental limitations of manual strategy design and enables scalable, task-agnostic LLM compression for resource-constrained environments.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) demonstrate exceptional reasoning abilities, enabling strong generalization across diverse tasks such as commonsense reasoning and instruction following. However, as LLMs scale, inference costs become increasingly prohibitive, accumulating significantly over their life cycle. In this paper we consider model compression for LLMs to reduce model size while improving downstream task performance. We phrase this as a neural architecture search problem that automatically prunes structural components, such as attention heads, neurons, and layers by searching for the Pareto-optimal set of sub-networks balancing between performance and on-device latency. Compared to state-of-the-art structural pruning approaches and fine-tuned smaller sub-networks extracted from the pre-trained model, our method achieves upto 9.85% improvement on average on 11 diverse downstream tasks, while achieving up to 22% improvement of on-device latency.

Problem

Research questions and friction points this paper is trying to address.

Compress Large Language Models efficiently

Optimize model size and task performance

Balance performance and on-device latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated sub-network search

Prunes structural components

Balances performance and latency

🔎 Similar Papers

No similar papers found.