Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

šŸ“… 2026-05-08
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF

career value

219K/year
šŸ¤– AI Summary
Existing context parallelism approaches struggle to efficiently handle varying training sequence lengths, resulting in low computational efficiency, high communication overhead, and load imbalance. This work proposes FCP—a flexible context parallelism paradigm that overcomes the limitations of conventional ring-based communication through block-level sequence partitioning, dynamic bin-packing for hybrid sequence scheduling, arbitrary-topology point-to-point communication, and optimized distributed attention mechanisms. These innovations enable unified and efficient scheduling of both short and long sequences while achieving balanced workload distribution. Evaluated on 256 NVIDIA GPUs, FCP demonstrates near-linear scalability and achieves a 1.13–2.21Ɨ improvement in Model FLOPs Utilization (MFU) for the attention module.
šŸ“ Abstract
Context parallelism (CP) has been widely adopted to support the growing context length in foundation model pretraining. However, existing designs fail to handle the large variation in sequence length from training datasets, resulting in suboptimal performance. These methods often over-shard short sequences, leading to compute inefficiency and excessive communication, or process long and short sequences separately without proper bin-packing, causing workload imbalance. In this paper, we propose FCP, a flexible context parallelism paradigm that shards and schedules sequences at block-level granularity. Instead of relying on rigid communication topologies such as ring, FCP enables arbitrary peer-to-peer communication, allowing flexible placement of sequence blocks across workers. By bin-packing blocks from both short and long sequences, FCP achieves both high compute efficiency and balanced workload distribution. Extensive evaluations show that FCP attains near-linear scalability on up to 256 NVIDIA GPUs, with 1.13x-2.21x improvement in the attention MFU.
Problem

Research questions and friction points this paper is trying to address.

context parallelism
sequence length variation
workload imbalance
compute inefficiency
foundation model pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

context parallelism
block-level scheduling
bin-packing
peer-to-peer communication
scalable pretraining
šŸ”Ž Similar Papers