SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

227K/year
🤖 AI Summary
This work addresses the significant accuracy degradation commonly observed in semi-structured pruning of large language models due to structural coupling, a challenge exacerbated by existing methods that rely on costly sparse retraining. To overcome this limitation, the authors propose SparseForge, a novel framework that introduces, for the first time, a Hessian-guided soft mask annealing mechanism into post-training sparsification. By integrating Hessian-aware importance estimation with progressive mask optimization, SparseForge generates hardware-friendly 2:4 sparse patterns without requiring extensive retraining. Evaluated on LLaMA-2-7B, the method achieves an average zero-shot accuracy of 57.27% using only 5 billion retraining tokens—surpassing the dense model’s 56.43% and approaching the state-of-the-art result of 57.52% obtained with 40 billion tokens—thereby substantially improving the trade-off between accuracy and efficiency.
📝 Abstract
Semi-structured sparsity provides a practical path to accelerate large language models (LLMs) with native hardware support, but post-training semi-structured pruning often suffers from substantial quality degradation due to strong structural coupling. Existing methods rely on large-scale sparse retraining to recover accuracy, resulting in high computational cost. We propose SparseForge, a post-training framework that improves recovery efficiency by directly optimizing the sparsity mask rather than scaling up retraining tokens. SparseForge combines Hessian-aware importance estimation with progressive annealing of soft masks into hardware-executable structured sparsity, enabling stable and efficient sparse recovery. On LLaMA-2-7B under 2:4 sparsity, SparseForge achieves 57.27% average zero-shot accuracy with only $\textbf{5B}$ retraining tokens, surpassing the dense model's 56.43% accuracy and approaching the 57.52% result of a state-of-the-art method using $\textbf{40B}$ tokens. Such improvements on the accuracy-efficiency trade-off from SparseForge are shown to be consistent across model families.
Problem

Research questions and friction points this paper is trying to address.

semi-structured sparsity
post-training pruning
large language models
accuracy degradation
computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

semi-structured sparsity
Hessian-guided pruning
soft-mask annealing
post-training sparsification
LLM acceleration