FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

📅 2025-02-28
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
To address the quadratic computational complexity of attention computation during the prefill phase in long-sequence inference for large language models (LLMs), this paper proposes a dynamic head-adaptive sparse attention mechanism. Unlike fixed or sample-static sparsity patterns, our method introduces query-aware sparse pattern switching and cumulative-attention-driven index selection, coupled with Jensen–Shannon divergence for measuring attention distribution similarity. This enables context-aware, per-head optimization of both sparsity structure and sparsity ratio at inference time. Experiments on long-sequence tasks demonstrate that our approach significantly improves throughput—up to 1.8× faster than mainstream sparse baselines—while preserving near-full-attention accuracy, with average accuracy degradation below 0.5%. The method thus achieves a favorable trade-off between efficiency and performance without compromising model fidelity.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) encounter computational challenges during long-sequence inference, especially in the attention pre-filling phase, where the complexity grows quadratically with the prompt length. Previous efforts to mitigate these challenges have relied on fixed sparse attention patterns or identifying sparse attention patterns based on limited cases. However, these methods lacked the flexibility to efficiently adapt to varying input demands. In this paper, we introduce FlexPrefill, a Flexible sparse Pre-filling mechanism that dynamically adjusts sparse attention patterns and computational budget in real-time to meet the specific requirements of each input and attention head. The flexibility of our method is demonstrated through two key innovations: 1) Query-Aware Sparse Pattern Determination: By measuring Jensen-Shannon divergence, this component adaptively switches between query-specific diverse attention patterns and predefined attention patterns. 2) Cumulative-Attention Based Index Selection: This component dynamically selects query-key indexes to be computed based on different attention patterns, ensuring the sum of attention scores meets a predefined threshold. FlexPrefill adaptively optimizes the sparse pattern and sparse ratio of each attention head based on the prompt, enhancing efficiency in long-sequence inference tasks. Experimental results show significant improvements in both speed and accuracy over prior methods, providing a more flexible and efficient solution for LLM inference.
Problem

Research questions and friction points this paper is trying to address.

Addresses computational challenges in long-sequence LLM inference.
Introduces dynamic sparse attention pattern adjustment for efficiency.
Enhances speed and accuracy in attention pre-filling phase.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic sparse attention pattern adjustment
Query-aware sparse pattern determination
Cumulative-attention based index selection
🔎 Similar Papers
No similar papers found.