ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

To address communication redundancy and computational load imbalance caused by sequence-length heterogeneity in long-context LLM training, this paper proposes Hybrid Data Parallelism (HDP), a framework that jointly optimizes communication and computation via dynamic grid-based communication, data-aware sequence sharding, selective gradient/activation offloading, and parallelism-aware load-balancing scheduling. HDP enables adaptive coordination between communication and computation across heterogeneous sequence lengths. To our knowledge, it is the first system to support efficient joint training of mixed-length sequences on ultra-large-scale clusters exceeding 12,000 GPUs. Evaluated on models ranging from 7B to 141B parameters and context lengths from 256K to 2048K tokens, HDP achieves up to 7.89× higher throughput compared to state-of-the-art systems, demonstrating substantial improvements in scalability and efficiency for long-context LLM training.

Technology Category

Application Category

📝 Abstract

Scaling long-context ability is essential for Large Language Models (LLMs). To amortize the memory consumption across multiple devices in long-context training, inter-data partitioning (a.k.a. Data Parallelism) and intra-data partitioning (a.k.a. Context Parallelism) are commonly used. Current training frameworks predominantly treat the two techniques as orthogonal, and establish static communication groups to organize the devices as a static mesh (e.g., a 2D mesh). However, the sequences for LLM training typically vary in lengths, no matter for texts, multi-modalities or reinforcement learning. The mismatch between data heterogeneity and static mesh causes redundant communication and imbalanced computation, degrading the training efficiency. In this work, we introduce ByteScale, an efficient, flexible, and scalable LLM training framework for large-scale mixed training of long and short sequences. The core of ByteScale is a novel parallelism strategy, namely Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design. In particular, we build a communication optimizer, which eliminates the redundant communication for short sequences by data-aware sharding and dynamic communication, and further compresses the communication cost for long sequences by selective offloading. Besides, we also develop a balance scheduler to mitigate the imbalanced computation by parallelism-aware data assignment. We evaluate ByteScale with the model sizes ranging from 7B to 141B, context lengths from 256K to 2048K, on a production cluster with more than 12,000 GPUs. Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89x.

Problem

Research questions and friction points this paper is trying to address.

Efficient scaling of LLM training with long context lengths.

Reducing redundant communication and imbalanced computation in LLM training.

Developing a dynamic parallelism strategy for mixed sequence training.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Data Parallelism unifies inter- and intra-data partitioning.

Dynamic mesh design optimizes communication for varying sequence lengths.

Balance scheduler mitigates computation imbalance in large-scale training.

🔎 Similar Papers

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer