🤖 AI Summary
Large language models face significant computational and memory overhead from dense attention during long-context inference, yet the necessity of such density remains unclear. This work proposes introducing principled, extreme sparsity along the context dimension during inference and provides the first systematic demonstration that sparsity is not merely a heuristic trick but a viable foundation for model inference, training, and architecture design. Through extensive experiments spanning 20 models across five families and a custom sparse decoding kernel efficiently implemented on H100 GPUs, the approach achieves up to 10× speedup over FlashInfer at 50× sparsity while maintaining stable performance across diverse tasks—including retrieval, multi-hop question answering, mathematical reasoning, and agent-based coding—revealing a strong robustness of current models to sparsity during inference.
📝 Abstract
Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift toward longer contexts and agentic interactions, the compute and memory bottlenecks of attention become increasingly critical, raising the question of whether these constraints are fundamental. Our position is that these constraints are artificial and unnecessary, and that the future of LLM inference lies in extreme but principled sparsity along the context dimension. This position is supported by several strands of empirical and theoretical evidence. First, we find the insistence on dense attention unreasonable, since in a long context a query effectively projects O(N) attention information into a hidden space of dimension d << N, making the process inherently lossy. Second, we perform an extensive study of sparsity in LLMs spanning 20 models across five model families, varying context lengths, and different sparsity levels. We empirically demonstrate a strong trend: current LLMs, despite not being trained for context sparsity, are remarkably robust to inference-time decode sparsity across tasks of varying complexity, including retrieval, multi-hop QA, mathematical reasoning, and agentic coding. Importantly, we also show that current hardware is already sufficient to realize substantial gains from this sparsity. For example, our sparse decode kernels accelerate large-context processing by up to 10x over FlashInfer at 50x sparsity levels on hardware such as the H100. Overall, these results position extreme context sparsity not as a heuristic, but as a principled foundation for LLM inference, training, and architecture design: one that is both feasible and beneficial, and a compelling direction for future systems.