SOCKET: SOft Collison Kernel EsTimator for Sparse Attention

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of attention in long-context reasoning, where existing sparse attention methods rely on rigid hash-based matching that struggles to balance efficiency and accuracy. The authors propose SOCKET, the first approach to reformulate locality-sensitive hashing (LSH) as a differentiable, similarity-aware soft collision kernel estimator. By aggregating graded evidence across multiple hash tables, SOCKET enables stable and efficient selection of top-k critical tokens without heuristic voting, grounded in rigorous mathematical foundations. The method integrates a custom CUDA scoring kernel with a FlashDecode Triton-based sparse attention backend. Experiments demonstrate that SOCKET outperforms state-of-the-art sparse attention techniques across multiple long-context benchmarks, achieving up to 1.5× the throughput of FlashAttention.

Technology Category

Application Category

📝 Abstract
Exploiting sparsity during long-context inference is central to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends critically on efficient scoring and selection of relevant tokens at inference time. We revisit Locality-Sensitive Hashing (LSH) as a sparsification primitive and introduce SOCKET, a SOft Collision Kernel EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation. Our key insight is that hard LSH produces discrete collision signals and is therefore poorly suited for ranking. In contrast, soft LSH aggregates graded collision evidence across hash tables, preserving the stability of relative ordering among the true top-$k$ tokens. This transformation elevates LSH from a candidate-generation heuristic to a principled and mathematically grounded scoring kernel for sparse attention. Leveraging this property, SOCKET enables efficient token selection without ad-hoc voting mechanism, and matches or surpasses established sparse attention baselines across multiple long-context benchmarks using diverse set of models. With a custom CUDA kernel for scoring keys and a Flash Decode Triton backend for sparse attention, SOCKET achieves up to 1.5$\times$ higher throughput than FlashAttention, making it an effective tool for long-context inference. Code is open-sourced at https://github.com/amarka8/SOCKET.
Problem

Research questions and friction points this paper is trying to address.

sparse attention
long-context inference
token selection
attention sparsification
efficient scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Attention
Locality-Sensitive Hashing
Soft Collision Kernel
Long-Context Inference
Efficient Token Selection
🔎 Similar Papers
No similar papers found.
S
Sahil Joshi
Department of Computer Science, Rice University, TX, USA
A
Agniva Chowdhury
Department of Computer Science, Rice University, TX, USA
W
Wyatt Bellinger
Department of Computer Science, Rice University, TX, USA
A
Amar Kanakamedala
Department of Computer Science, Rice University, TX, USA
E
Ekam Singh
Department of Computer Science, Rice University, TX, USA
H
Hoang Anh Duy Le
Department of Computer Science, Rice University, TX, USA
Aditya Desai
Aditya Desai
University of California, Berkeley
Machine learning efficiencyhashingsketchingsampling
Anshumali Shrivastava
Anshumali Shrivastava
Rice University, ThirdAI Corp.
Machine LearningLarge Scale Deep LearningInformation Retrieval