LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

πŸ“… 2025-02-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the high communication overhead, low computational parallelism, and poor scalability of existing sequence parallelism (SP) methods for linear attention models in distributed training of ultra-long sequences (e.g., 2048K). We propose LASP-2, a novel SP method specifically designed for linear attention. Its core contribution is the first formal characterization of the minimal communication requirements under the right-multiplication-priority property of linear attention, enabling a constant one-time AllGather with communication volume decoupled from sequence length. Furthermore, we introduce LASP-2Hβ€”the first SP extension supporting hybrid linear + standard attention architectures. Evaluated on 64 GPUs, LASP-2 achieves 15.2% speedup over LASP and 36.6% over Ring Attention. The implementation is open-sourced in the Linear-MoE project.

Technology Category

Application Category

πŸ“ Abstract
Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.
Problem

Research questions and friction points this paper is trying to address.

Enhances sequence parallelism for linear attention models.
Improves communication and computation parallelism in training.
Extends optimization to hybrid linear and standard attention models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes communication for linear attention
Reduces communication to single AllGather
Extends SP to hybrid attention models
πŸ”Ž Similar Papers
No similar papers found.