AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenges of high computational complexity in attention mechanisms and substantial memory overhead from key-value (KV) caching that large language models face during long-context reasoning. The authors propose a hierarchical sparse attention mechanism that uniquely integrates block-level coarse-grained filtering with token-level fine-grained selection. Complementing this, they design an asynchronous KV cache offloading engine leveraging temporal locality to overlap cache transfers with computation. The approach is compatible with both Grouped-Query Attention (GQA) and Multi-Layer Attention (MLA) architectures. Evaluated across context lengths of 48k to 96k tokens, the method achieves accuracy on par with full attention while delivering operator-level speedups of 1.2×–10.0× and end-to-end throughput improvements of 1.3×–4.7×, significantly reducing both latency and memory consumption.

Technology Category

Application Category

📝 Abstract

Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.

Problem

Research questions and friction points this paper is trying to address.

long-context inference

quadratic attention complexity

KV cache memory

sparse attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse attention

asynchronous offloading

KV cache optimization

long-context inference