SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of NVIDIA’s 2:4 sparse Tensor Cores, which enforce a rigid 50% sparsity constraint that significantly degrades the accuracy of large language models, while more flexible (2N−2):2N sparsity patterns—though accuracy-preserving—lack hardware acceleration. To bridge this gap, the authors propose a sliding window decomposition technique that losslessly restructures arbitrary (2N−2):2N sparse weights into overlapping 2:4-compatible windows. Combined with an activation uplifting strategy that integrates activation reordering into per-token quantization, this approach enables the first lossless Tensor Core acceleration of (2N−2):2N sparse models on commodity GPUs. Experiments demonstrate a measured 1.33× speedup—approaching the theoretical limit of 4/3—on models such as Qwen2.5-7B, with verified accuracy preservation and efficient inference across multiple GPU generations, including A100, H100, and B200.

Technology Category

Application Category

📝 Abstract
NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder $(2N-2):2N$ patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the $(2N-2):2N$ model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any $(2N-2):2N$ weight block into $N-1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute-bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper-bound $N/(N-1)=4/3$ at 6:8 weight sparsity in Qwen2.5-7B, establishing $(2N-2):2N$ as a practical path to accuracy-preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.
Problem

Research questions and friction points this paper is trying to address.

structured sparsity
LLM acceleration
Sparse Tensor Cores
model pruning
accuracy preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Sparsity
Sparse Tensor Cores
Sliding Window Decomposition
Activation Lifting
LLM Acceleration
H
Hanyong Shao
Peking University, Microsoft Research
Y
Yingbo Hao
South China University of Technology, Microsoft Research
T
Ting Song
Microsoft Research
Yan Xia
Yan Xia
Microsoft Research Asia
D
Di Zhang
Peking University, Microsoft Research
Shaohan Huang
Shaohan Huang
Microsoft Research Asia
Xun Wu
Xun Wu
Microsoft Research Asia
ReasoningMixture-of-ExpertsMulti-Modality
S
Songchen Xu
Microsoft Research, Shanghai Jiao Tong University
L
Le Xu
Microsoft Research, The Hong Kong University of Science and Technology
Li Dong
Li Dong
Microsoft Research
Natural Language Processing
Zewen Chi
Zewen Chi
Microsoft Research
language models
Yi Zou
Yi Zou
Intel Labs
Near-data and in-memory computingComputer Architecture and Computer SystemsNon-volatile storagedistributed storagebig da
Furu Wei
Furu Wei
Distinguished Scientist, Microsoft Research
Natural Language ProcessingArtificial IntelligenceGeneral AIGenerative AIMultimodal AI