Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

In large-batch LLM inference, context sparsity collapses due to rapid densification of jointly active neurons, causing throughput degradation and increased latency. Method: This paper proposes Polarized Sparsity—a novel mechanism that (i) identifies the key invariant: attention-head sparsity remains stable across batch sizes, enabling a strategic shift of sparsity focus from MLP to attention layers; (ii) introduces the first practical, large-batch-context sparsification scheme; and (iii) develops hardware-friendly, sparse-aware GPU kernels supporting selective, dynamic sparsity in both attention and MLP layers—compatible with OPT, LLaMA-2/3, and other mainstream architectures. Contributions/Results: The approach achieves up to 2.2× end-to-end inference speedup, maintains full accuracy across all batch sizes and sequence lengths, and is open-sourced.

Technology Category

Application Category

📝 Abstract

Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop hardware-efficient, sparsity-aware GPU kernels for selective MLP and Attention computations, delivering up to (2.2 imes) end-to-end speedups for models like OPT, LLaMA-2 &3, across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems. Our code is available at: https://github.com/susavlsh10/Polar-Sparsity.

Problem

Research questions and friction points this paper is trying to address.

Scaling contextual sparsity for large batch LLM inference

Optimizing MLP and Attention layer sparsity efficiency

Achieving high-throughput inference without accuracy loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Polar Sparsity shifts sparsity focus to Attention layers

Hardware-efficient GPU kernels for selective computations

Scales contextual sparsity to large batch sizes effectively

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval