Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the substantial computational and memory overhead incurred by large language models during long-context reasoning due to their self-attention mechanism. The authors propose Sketch&Walk, a training-free dynamic sparse attention method that approximates attention scores using lightweight Hadamard sketches and dynamically selects critical attention blocks through a cross-layer deterministic walk mechanism. Notably, Sketch&Walk is the first approach uniformly applicable to both prefill and decoding phases. It achieves near-lossless accuracy at only 20% attention density—sometimes even outperforming dense attention—and accelerates inference by up to 6×.

Technology Category

Application Category

📝 Abstract
Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6x inference speedup.
Problem

Research questions and friction points this paper is trying to address.

self-attention
LLM inference
computational cost
memory cost
long-context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sketch&Walk Attention
sparse attention
Hadamard sketching
training-free
long-context LLM inference
🔎 Similar Papers
No similar papers found.
H
Hoang Anh Duy Le
Department of Computer Science, Rice University
S
Sahil Joshi
Department of Computer Science, Rice University
Z
Zeyu Yang
Department of Computer Science, Rice University
Zhaozhuo Xu
Zhaozhuo Xu
Stevens Institute of Technology
Machine LearningNearest Neighbor Search
Anshumali Shrivastava
Anshumali Shrivastava
Rice University, ThirdAI Corp.
Machine LearningLarge Scale Deep LearningInformation Retrieval