Mosaic: Composite Projection Pruning for Resource-efficient LLMs

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

To address excessive computational and memory overhead in large language model (LLM) deployment, this paper proposes a fine-grained projection pruning method—the first to synergistically integrate unstructured and structured pruning advantages within a hardware-aware composite pruning framework. The method comprises three core components: (1) dynamic parameter importance estimation, (2) joint preservation of critical weights and hardware-friendly structural patterns, and (3) the Mosaic system enabling end-to-end optimization. Experiments on mainstream LLMs demonstrate that our approach achieves 7.19× faster pruning, reduces perplexity by 84.2%, and improves accuracy by 31.4%; it also cuts inference latency by 67% and GPU memory consumption by 68%. This work delivers simultaneous breakthroughs in pruning efficiency, accuracy retention, and hardware adaptability—establishing a scalable new paradigm for efficient LLM deployment.

Technology Category

Application Category

📝 Abstract

Extensive compute and memory requirements limit the deployment of large language models (LLMs) on any hardware. Compression methods, such as pruning, can reduce model size, which in turn reduces resource requirements. State-of-the-art pruning is based on coarse-grained methods. They are time-consuming and inherently remove critical model parameters, adversely impacting the quality of the pruned model. This paper introduces projection pruning, a novel fine-grained method for pruning LLMs. In addition, LLM projection pruning is enhanced by a new approach we refer to as composite projection pruning - the synergistic combination of unstructured pruning that retains accuracy and structured pruning that reduces model size. We develop Mosaic, a novel system to create and deploy pruned LLMs using composite projection pruning. Mosaic is evaluated using a range of performance and quality metrics on multiple hardware platforms, LLMs, and datasets. Mosaic is 7.19x faster in producing models than existing approaches. Mosaic models achieve up to 84.2% lower perplexity and 31.4% higher accuracy than models obtained from coarse-grained pruning. Up to 67% faster inference and 68% lower GPU memory use is noted for Mosaic models.

Problem

Research questions and friction points this paper is trying to address.

Reduce LLM resource requirements via pruning

Improve pruning accuracy and efficiency

Enable faster LLM deployment on hardware

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained projection pruning for LLMs

Composite pruning combines unstructured and structured methods

Mosaic system enables efficient model creation and deployment

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Principal Machine Learning Engineer

Red Hat

$189,600.00 - $312,730.00

Boston

Authors to Follow