๐ค AI Summary
This work addresses the quadratic computational bottleneck of standard attention mechanisms in long-context scenarios and the limitations of existing hybrid attention approaches, which struggle to balance dynamic task requirements with hardware efficiency due to static allocation and head-level sparsity. The authors propose a context-aware, hierarchical dynamic hybrid attention framework that employs a parameter-efficient, lightweight Layer Router to adaptively select between full and sparse attention for each layerโwithout fine-tuning the pretrained large language model. This approach introduces the first layer-wise dynamic routing mechanism, effectively balancing information fidelity with memory access continuity while avoiding the hardware synchronization overhead induced by head-level sparsity. Experiments demonstrate significant performance gains over baselines on multiple long-context and mathematical reasoning benchmarks, achieving up to 2.8ร speedup during prefill and 2.0ร during decode, with training requiring only eight A800 GPUs for 12 hours.
๐ Abstract
The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8$\times$A800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to $2.8\times$ and $2.0\times$ in the prefill and decode stages.