LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

214K/year
🤖 AI Summary
This work addresses the challenge that existing unstructured pruning methods for large language models typically adopt layer-wise optimization, which struggles to preserve end-to-end performance—particularly suffering significant accuracy degradation under high compression ratios. To overcome this limitation, the paper proposes a learnable, end-to-end unstructured pruning framework that, for the first time, introduces a differentiable Bernoulli mask based on Gumbel-Sigmoid to jointly optimize global sparsity structure and task accuracy. By relaxing binary masks and integrating them into end-to-end training, the method achieves consistent improvements across five large language models ranging from 0.5B to 8B parameters. At sparsity levels of 50%–60%, it outperforms the ADMM baseline by an average of 2.59 percentage points in accuracy across six zero-shot tasks, effectively breaking through the accuracy bottleneck inherent in layer-wise pruning approaches.
📝 Abstract
Unstructured sparsity is now natively accelerated by recent GPU kernels and dataflow hardware, shifting the bottleneck from inference execution to the pruning algorithm. State-of-the-art methods for unstructured LLM pruning are layer-wise surrogates derived from the Optimal Brain Surgeon principle, and they sacrifice end-to-end accuracy, especially under aggressive sparsity. End-to-end alternatives such as MaskLLM and PATCH show that learnable masks can close this gap, but their categorical-over-patterns parameterization scales with the number of valid masks per row and does not port to the unstructured setting. We introduce LEAP, which replaces this intractable parameterization with a per-weight Bernoulli-via-Gumbel- sigmoid relaxation that makes end-to-end unstructured mask learning tractable. Across five LLM families from 0.5B to 8B parameters at 50% and 60% sparsity, LEAP improves six-task average zero-shot accuracy by +2.59 points on average over ADMM, the best layer-wise baseline in our sweep.
Problem

Research questions and friction points this paper is trying to address.

unstructured pruning
large language models
end-to-end accuracy
sparsity
learnable masks
Innovation

Methods, ideas, or system contributions that make the work stand out.

unstructured pruning
end-to-end learning
Gumbel-sigmoid relaxation
large language models
learnable masks
🔎 Similar Papers
No similar papers found.