Compressing Large Language Models with Automated Sub-Network Search

📅 2024-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from high inference costs, and manually designed pruning strategies exhibit poor generalizability across architectures and tasks. Method: This paper proposes a multi-objective neural architecture search (NAS)-based compression framework tailored for edge deployment. It formulates structured pruning as a differentiable, end-to-end Pareto-optimal subnetwork search problem—jointly optimizing attention heads, hidden neurons, and network layers at multiple granularities—while enforcing structured sparsity constraints to ensure hardware efficiency. Contribution/Results: Evaluated on 11 downstream tasks, the method achieves an average performance gain of 9.85%, reduces edge-side latency by 22%, and significantly outperforms state-of-the-art structured pruning and lightweight fine-tuning approaches. By replacing heuristic, human-crafted pruning policies with automated, gradient-based NAS, it overcomes fundamental limitations of manual strategy design and enables scalable, task-agnostic LLM compression for resource-constrained environments.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) demonstrate exceptional reasoning abilities, enabling strong generalization across diverse tasks such as commonsense reasoning and instruction following. However, as LLMs scale, inference costs become increasingly prohibitive, accumulating significantly over their life cycle. In this paper we consider model compression for LLMs to reduce model size while improving downstream task performance. We phrase this as a neural architecture search problem that automatically prunes structural components, such as attention heads, neurons, and layers by searching for the Pareto-optimal set of sub-networks balancing between performance and on-device latency. Compared to state-of-the-art structural pruning approaches and fine-tuned smaller sub-networks extracted from the pre-trained model, our method achieves upto 9.85% improvement on average on 11 diverse downstream tasks, while achieving up to 22% improvement of on-device latency.
Problem

Research questions and friction points this paper is trying to address.

Compress Large Language Models efficiently
Optimize model size and task performance
Balance performance and on-device latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated sub-network search
Prunes structural components
Balances performance and latency
🔎 Similar Papers
No similar papers found.