AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of high computational complexity in attention mechanisms and substantial memory overhead from key-value (KV) caching that large language models face during long-context reasoning. The authors propose a hierarchical sparse attention mechanism that uniquely integrates block-level coarse-grained filtering with token-level fine-grained selection. Complementing this, they design an asynchronous KV cache offloading engine leveraging temporal locality to overlap cache transfers with computation. The approach is compatible with both Grouped-Query Attention (GQA) and Multi-Layer Attention (MLA) architectures. Evaluated across context lengths of 48k to 96k tokens, the method achieves accuracy on par with full attention while delivering operator-level speedups of 1.2×–10.0× and end-to-end throughput improvements of 1.3×–4.7×, significantly reducing both latency and memory consumption.
📝 Abstract
Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.
Problem

Research questions and friction points this paper is trying to address.

long-context inference
quadratic attention complexity
KV cache memory
sparse attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse attention
asynchronous offloading
KV cache optimization
long-context inference
hierarchical attention
🔎 Similar Papers
No similar papers found.
Yuxuan Hu
Yuxuan Hu
Renmin University of China
Jianchao Tan
Jianchao Tan
Meituan
LLMAutomated Machine LearningComputer GraphicsComputer Vision
J
Jiaqi Zhang
Meituan, Beijing, China
W
Wen Zan
Meituan, Beijing, China
P
Pingwei Sun
Meituan, Beijing, China
Y
Yifan Lu
Meituan, Beijing, China
Y
Yerui Sun
Meituan, Beijing, China
Y
Yuchen Xie
Meituan, Beijing, China
X
Xunliang Cai
Meituan, Beijing, China
Jing Zhang
Jing Zhang
Renmin University of China
large model alignmentmodel compression & inference optimizationdata intelligence