Task Specific Pruning with LLM-Sieve: How Many Parameters Does Your Task Really Need?

๐Ÿ“… 2025-05-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the urgent need for parameter-efficient large language models (LLMs) in resource-constrained settings for vertical tasks (e.g., medical question answering, sentiment analysis), this paper proposes LLM-Sieve, a task-aware structured pruning framework. Methodologically, it introduces: (1) a novel task-driven joint linear projection that approximates target-task output behavior; (2) layer-wise differentiated weight matrix pruning via genetic algorithm optimization; and (3) synergistic integration of LoRA fine-tuning and quantization, enabling cross-dataset generalization within the same task. Evaluated across multiple domain-specific benchmarks, LLM-Sieve achieves 20โ€“75% parameter compression with only 1โ€“5% accuracy degradation, significantly improving inference speed and deployment efficiency. The resulting models are smaller, faster, and more accurateโ€”tailored specifically to downstream vertical applications.

Technology Category

Application Category

๐Ÿ“ Abstract
As Large Language Models (LLMs) are increasingly being adopted for narrow tasks - such as medical question answering or sentiment analysis - and deployed in resource-constrained settings, a key question arises: how many parameters does a task actually need? In this work, we present LLM-Sieve, the first comprehensive framework for task-specific pruning of LLMs that achieves 20-75% parameter reduction with only 1-5% accuracy degradation across diverse domains. Unlike prior methods that apply uniform pruning or rely on low-rank approximations of weight matrices or inputs in isolation, LLM-Sieve (i) learns task-aware joint projections to better approximate output behavior, and (ii) employs a Genetic Algorithm to discover differentiated pruning levels for each matrix. LLM-Sieve is fully compatible with LoRA fine-tuning and quantization, and uniquely demonstrates strong generalization across datasets within the same task domain. Together, these results establish a practical and robust mechanism to generate smaller performant task-specific models.
Problem

Research questions and friction points this paper is trying to address.

Determining optimal parameter count for task-specific LLMs
Achieving high parameter reduction with minimal accuracy loss
Developing differentiated pruning for diverse task domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-aware joint projections for output approximation
Genetic Algorithm for differentiated pruning levels
Compatible with LoRA fine-tuning and quantization
๐Ÿ”Ž Similar Papers
No similar papers found.
W
Waleed Reda
Microsoft Research
Abhinav Jangda
Abhinav Jangda
Microsoft Research
High Performance ComputingProgramming LanguagesSystems
K
Krishna Chintalapudi
Microsoft Research