Mosaic: Composite Projection Pruning for Resource-efficient LLMs

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address excessive computational and memory overhead in large language model (LLM) deployment, this paper proposes a fine-grained projection pruning method—the first to synergistically integrate unstructured and structured pruning advantages within a hardware-aware composite pruning framework. The method comprises three core components: (1) dynamic parameter importance estimation, (2) joint preservation of critical weights and hardware-friendly structural patterns, and (3) the Mosaic system enabling end-to-end optimization. Experiments on mainstream LLMs demonstrate that our approach achieves 7.19× faster pruning, reduces perplexity by 84.2%, and improves accuracy by 31.4%; it also cuts inference latency by 67% and GPU memory consumption by 68%. This work delivers simultaneous breakthroughs in pruning efficiency, accuracy retention, and hardware adaptability—establishing a scalable new paradigm for efficient LLM deployment.

Technology Category

Application Category

📝 Abstract
Extensive compute and memory requirements limit the deployment of large language models (LLMs) on any hardware. Compression methods, such as pruning, can reduce model size, which in turn reduces resource requirements. State-of-the-art pruning is based on coarse-grained methods. They are time-consuming and inherently remove critical model parameters, adversely impacting the quality of the pruned model. This paper introduces projection pruning, a novel fine-grained method for pruning LLMs. In addition, LLM projection pruning is enhanced by a new approach we refer to as composite projection pruning - the synergistic combination of unstructured pruning that retains accuracy and structured pruning that reduces model size. We develop Mosaic, a novel system to create and deploy pruned LLMs using composite projection pruning. Mosaic is evaluated using a range of performance and quality metrics on multiple hardware platforms, LLMs, and datasets. Mosaic is 7.19x faster in producing models than existing approaches. Mosaic models achieve up to 84.2% lower perplexity and 31.4% higher accuracy than models obtained from coarse-grained pruning. Up to 67% faster inference and 68% lower GPU memory use is noted for Mosaic models.
Problem

Research questions and friction points this paper is trying to address.

Reduce LLM resource requirements via pruning
Improve pruning accuracy and efficiency
Enable faster LLM deployment on hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained projection pruning for LLMs
Composite pruning combines unstructured and structured methods
Mosaic system enables efficient model creation and deployment
🔎 Similar Papers
No similar papers found.
B
Bailey J. Eccles
School of Computer Science, University of St Andrews, UK
L
Leon Wong
Autonomous Networking Research & Innovation Department, Rakuten Mobile, Inc.
Blesson Varghese
Blesson Varghese
Reader in Computer Science, University of St Andrews, UK
Distributed systemsCloud/Edge computingEdge intelligenceDistributed machine learning