HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational complexity (O(n²)) of attention mechanisms and excessive memory consumption in long-sequence Transformer training—which degrade pipeline parallelism efficiency—this paper proposes HelixPipe. It introduces the first attention-parallel partitioning scheme, decoupling sequence-wise and model-wise parallelization strategies. We design a two-level FILS (First-In-Last-Served) micro-batch scheduler to significantly reduce pipeline bubbles. Furthermore, HelixPipe integrates attention-free recomputation and block-wise MLP computation to lower peak memory usage and improve compute-communication overlap. Evaluated on a 64-GPU H20 cluster training a 7B-parameter model on 128K-length sequences, HelixPipe achieves a 26% throughput improvement over the baseline. It demonstrates strong scalability across diverse model sizes and heterogeneous hardware configurations, validating its robustness and practicality for large-scale, long-context training.

Technology Category

Application Category

📝 Abstract
As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel pipeline parallelism for long sequence transformer training. First, HelixPipe introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles. Second, it employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with computation. Additionally, HelixPipe utilizes recomputation without attention and chunked MLP to mitigate fragmentation and enable longer sequences. Experiments demonstrate that HelixPipe gains increasing advantages with longer sequence lengths, and outperforms existing methods in throughput and scalability across varying pipeline sizes, model sizes, and cluster configurations. Notably, it achieves a 26% speedup over baseline methods when training a 7B model with 128k sequence length on 64 H20 GPUs. Code is available at https://github.com/code-tunnel/Megatron-LM/tree/dev.
Problem

Research questions and friction points this paper is trying to address.

Efficient training for long sequence transformers
Reducing pipeline bubbles in attention computation
Balancing memory usage and communication overlap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention parallel partition reduces pipeline bubbles
Two-fold micro batch schedule balances memory usage
Recomputation without attention enables longer sequences