🤖 AI Summary
The attention mechanism in large Transformer models incurs quadratic computational and memory overhead with sequence length, posing a critical bottleneck for long-context training. Existing efforts focus either on operator-level optimizations (e.g., sparse/dense attention) or module-level techniques (e.g., context parallelism), yet lack cross-framework, multi-dimensional systematic evaluation. Method: We introduce the first unified benchmark for long-context attention, enabling reproducible, comprehensive evaluation across two key dimensions—attention mask patterns and distributed scale—while integrating kernel-level optimizations and context parallelism strategies. The framework supports modular extensibility and incorporates state-of-the-art attention kernels and parallel mechanisms. Experiments are conducted at scale, up to 96 GPUs. Contribution/Results: Our empirical analysis reveals efficiency, scalability, and deployment trade-offs of diverse approaches under extreme long-context regimes, providing actionable, evidence-based guidance for designing practical large-model training systems.
📝 Abstract
Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for long-context training. Prior work tackles this challenge along two directions: (1) kernel-level optimizations, which accelerate dense and sparse attention operators; and (2) module-level strategies, often referred to as distributed attention or context parallel training, which scale attention across multiple devices. However, systematic evaluation still remains limited: operator-level comparisons are often incomplete, while context parallel strategies are typically framework-specific, with unclear performance analysis across contexts. To address these gaps, we propose a unified benchmark that integrates representative attention kernels and context parallel mechanisms with a modular and extensible interface for evaluation. The benchmark evaluates methods along two critical dimensions: (1) attention mask patterns, which strongly affect efficiency, scalability, and usability, and (2) sequence length and distributed scale, which determine performance under extreme long-context training. Through comprehensive experiments on the cluster of up to 96 GPUs, our benchmark enables reproducible comparisons, highlights method-specific trade-offs, and provides practical guidance for designing and deploying attention mechanisms in long-context LLM training.