PATCH: Learnable Tile-level Hybrid Sparsity for LLMs

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) faces challenges of high memory and computational overhead, while conventional pruning methods struggle to balance accuracy preservation and hardware acceleration: unstructured sparsity impedes GPU efficiency, and fixed 2:4 structured sparsity degrades model quality. This paper proposes PATCH—the first learnable tile-level hybrid sparsity framework—which employs trainable masks to dynamically select either dense or 2:4-sparse blocks within weight matrices, enabling continuous sparsity ratios from 0% to 50% and layer-wise non-uniform sparsity. PATCH pioneers fine-grained, learnable hybrid sparsity for LLM compression. Evaluated on LLaMA-2 7B, it achieves 1.18×–1.38× end-to-end inference speedup over dense baselines, with accuracy improvements of 0.37%–2.96% over MaskLLM. Moreover, it significantly narrows the performance gap with dense models across a wide scale—from 0.5B to 8B parameters.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an effective way to reduce these overheads, yet existing approaches face challenges: unstructured sparsity, where nonzeros can appear anywhere, preserves accuracy but yields irregular access patterns that prevent GPU acceleration, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality. To bridge this gap, we introduce PATCH, a hybrid sparsity framework that enables a continuous sparsity ratio between 0% and 50%. PATCH partitions weight matrices into tiles, assigning each tile to be either dense or 2:4 sparse via a learnable mask selection mechanism. This design provides fine-grained control over accuracy-acceleration tradeoffs and supports non-uniform sparsity across layers, leading to superior overall quality. Across models from 0.5B to 8B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. For instance, on LLaMA-2 7B with an A6000 GPU, PATCH achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to the state-of-the-art 2:4 pruning method, MaskLLM.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory and compute costs of large language models
Bridging unstructured and semi-structured sparsity limitations
Enabling flexible accuracy-acceleration tradeoffs via hybrid sparsity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid sparsity framework with continuous sparsity ratio
Learnable tile-level dense or 2:4 sparse mask selection
Non-uniform sparsity across layers for accuracy-speed tradeoffs
🔎 Similar Papers
No similar papers found.