🤖 AI Summary
This work addresses the challenges of high computational complexity in attention mechanisms and substantial memory overhead from key-value (KV) caching that large language models face during long-context reasoning. The authors propose a hierarchical sparse attention mechanism that uniquely integrates block-level coarse-grained filtering with token-level fine-grained selection. Complementing this, they design an asynchronous KV cache offloading engine leveraging temporal locality to overlap cache transfers with computation. The approach is compatible with both Grouped-Query Attention (GQA) and Multi-Layer Attention (MLA) architectures. Evaluated across context lengths of 48k to 96k tokens, the method achieves accuracy on par with full attention while delivering operator-level speedups of 1.2×–10.0× and end-to-end throughput improvements of 1.3×–4.7×, significantly reducing both latency and memory consumption.
📝 Abstract
Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.