Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
This work addresses the GPU memory bottleneck in Transformer models caused by high parameter and activation memory demands during training and inference. The authors propose a novel parallelism strategy that integrates tensor parallelism (TP) and sequence parallelism (SP) along the same device axis, enabling each device to simultaneously shard both model weights and input sequences. By leveraging broadcast-based weight sharding with key-value exchange in attention layers and ring-based weight passing with local accumulation in gated MLPs, the method achieves dual compression of both parameter and activation memory. This approach significantly reduces per-device memory consumption, outperforming conventional TP, SP, and their hybrid variants. It demonstrates superior hardware adaptability and scaling efficiency under long-context and memory-constrained settings, while seamlessly integrating with pipeline and expert parallelism.
📝 Abstract
We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single device axis. In conventional multi-dimensional parallelism layouts, tensor parallelism (TP) shards model weights while sequence parallelism (SP) shards tokens, reducing per-device parameter or activation memory, respectively. Traditionally, each scheme is assigned its own mesh dimension. TSP instead assigns each rank both a weight shard and a sequence shard, reducing both parameter and activation memory along the same device axis. We implement this design with two runtime schedules. For attention, ranks iterate over broadcast parameter shards and reconstruct context through a sequence-wise key/value exchange. For gated MLPs, weight shards circulate in a ring while partial outputs accumulate locally. By sharding both weights and activations across the same devices, TSP trades additional communication volume for reduced memory overhead. We provide a theoretical communication and memory analysis, describe our implementation of TSP attention and gated MLP blocks, and benchmark TSP against TP, SP, and TP+SP. These results position TSP as a hardware-aware alternative for long-context and memory-constrained model training, and as a viable axis of parallelism in concert with existing parallelism schemes such as pipeline and expert parallelism for dense and mixture-of-expert models.
Problem

Research questions and friction points this paper is trying to address.

memory-efficient
transformer
tensor parallelism
sequence parallelism
model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

tensor parallelism
sequence parallelism
memory-efficient training
Transformer optimization
distributed deep learning
🔎 Similar Papers
No similar papers found.